{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create a short text clustering system using AWS SageMaker jumpstart pre-trained transformer models "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. [Introduction](#Introduction)  \n",
    "2. [Setup](#Setup)\n",
    "3. [Create a model for text embeddings from the Jumpstart solutions library of models](#Create-a-model-for-text-embeddings-from-the-Jumpstart-solutions-library-of-models)\n",
    "4. [Data pre-processing](#Data-pre-processing)\n",
    "5. [Create phrase (sentence) embeddings](#Create-phrase-(sentence)-embeddings)\n",
    "6. [Cluster phrases (sentences)](#Cluster-phrases-(sentences))\n",
    "7. [Automatic cluster labeling](#Automatic-cluster-labeling)\n",
    "8. [Batch process the entire dataset](#Batch-process-the-entire-dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we demonstrate how you can cluster short text (phrases) using the pre-trained transformer models on [AWS SageMaker Jumpstart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html). Here we will demonstrate the use of a transformer model called [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli). The model is used to create an embedding of phrases that we will then use to cluster such phrases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by updating the required packages i.e. SageMaker Python SDK, pandas, numpy, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Keyring is skipped due to an exception: 'keyring.backends'\n",
      "Requirement already satisfied: boto3 in /opt/conda/lib/python3.7/site-packages (1.26.24)\n",
      "Requirement already satisfied: jsonlines in /opt/conda/lib/python3.7/site-packages (3.1.0)\n",
      "Requirement already satisfied: seaborn in /opt/conda/lib/python3.7/site-packages (0.10.0)\n",
      "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from boto3) (0.6.0)\n",
      "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3) (1.0.1)\n",
      "Requirement already satisfied: botocore<1.30.0,>=1.29.24 in /opt/conda/lib/python3.7/site-packages (from boto3) (1.29.24)\n",
      "Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.7/site-packages (from jsonlines) (4.4.0)\n",
      "Requirement already satisfied: attrs>=19.2.0 in /opt/conda/lib/python3.7/site-packages (from jsonlines) (22.1.0)\n",
      "Requirement already satisfied: pandas>=0.22.0 in /opt/conda/lib/python3.7/site-packages (from seaborn) (1.3.5)\n",
      "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from seaborn) (1.21.6)\n",
      "Requirement already satisfied: scipy>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from seaborn) (1.4.1)\n",
      "Requirement already satisfied: matplotlib>=2.1.2 in /opt/conda/lib/python3.7/site-packages (from seaborn) (3.1.3)\n",
      "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.24->boto3) (1.26.13)\n",
      "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.24->boto3) (2.8.2)\n",
      "Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (1.1.0)\n",
      "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (2.4.6)\n",
      "Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (0.10.0)\n",
      "Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn) (2019.3)\n",
      "Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from cycler>=0.10->matplotlib>=2.1.2->seaborn) (1.14.0)\n",
      "Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->seaborn) (59.3.0)\n",
      "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
      "\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install boto3 jsonlines seaborn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Note: Restart the notebook's kernel after installing the above packages.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "import json\n",
    "import re\n",
    "import os\n",
    "\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import math\n",
    "\n",
    "import nltk\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.feature_extraction.text import TfidfTransformer \n",
    "from sklearn.feature_extraction.text import CountVectorizer \n",
    "\n",
    "from sklearn.manifold import TSNE\n",
    "from sklearn.cluster import SpectralClustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use NLTK library to help us with the pre-processing of the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "session = boto3.Session()\n",
    "sagemaker_execution_role = get_execution_role()\n",
    "s3 = session.resource('s3')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nltk.download('stopwords')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a model for text embeddings from the Jumpstart solutions library of models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use one the text embedding [models available in SageMaker jumpstart](https://sagemaker.readthedocs.io/en/v2.129.0/doc_utils/pretrainedmodels.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Chose a model for Inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "#We choose the tensorflow-tcembedding-universal-sentence-encoder-cmlm-en-large-1 as the default model since it is better suited for phrase analysis\n",
    "\n",
    "model_id, model_version = (\n",
    "    \"tensorflow-tcembedding-universal-sentence-encoder-cmlm-en-large-1\", \n",
    "    \"*\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "You can continue with the default model, or can choose a different model from the dropdown generated upon running the next cell. A complete list of SageMaker pre-trained models can also be accessed at [Sagemaker pre-trained Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from ipywidgets import Dropdown\n",
    "from sagemaker.jumpstart.notebook_utils import list_jumpstart_models, list_jumpstart_tasks\n",
    "\n",
    "# Retrieves all text embedding models.\n",
    "filter_value = \"task == tcembedding\"\n",
    "tcembedding_models = list_jumpstart_models(filter=filter_value)\n",
    "\n",
    "# display the model-ids in a dropdown to select a model for inference.\n",
    "model_dropdown = Dropdown(\n",
    "    options=tcembedding_models,\n",
    "    value=model_id,\n",
    "    description=\"Select a model\",\n",
    "    style={\"description_width\": \"initial\"},\n",
    "    layout={\"width\": \"max-content\"},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ae82576e5f51430da1a128adba6a276d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Dropdown(description='Select a model', index=33, layout=Layout(width='max-content'), options=('mxnet-tcembeddi…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(model_dropdown)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# model_version=\"*\" fetches the latest version of the model\n",
    "model_id, model_version = model_dropdown.value, \"*\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.utils import name_from_base\n",
    "\n",
    "model_name = name_from_base(f\"jumpstart-example-infer-{model_id}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the model from the selected model_id, model_version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker import image_uris, model_uris, script_uris, hyperparameters\n",
    "from sagemaker.model import Model\n",
    "from sagemaker.predictor import Predictor\n",
    "\n",
    "inference_instance_type = \"ml.m5.xlarge\"  #You can change the instance according to your needs\n",
    "\n",
    "# Retrieve the inference docker container uri. This is the base Tensorflow container image for the default model above.\n",
    "deploy_image_uri = image_uris.retrieve(\n",
    "    region=None,\n",
    "    framework=None,  # automatically inferred from model_id\n",
    "    image_scope=\"inference\",\n",
    "    model_id=model_id,\n",
    "    model_version=model_version,\n",
    "    instance_type=inference_instance_type,\n",
    ")\n",
    "\n",
    "# Retrieve the inference script uri. This includes all dependencies and scripts for model loading, inference handling etc.\n",
    "deploy_source_uri = script_uris.retrieve(\n",
    "    model_id=model_id, model_version=model_version, script_scope=\"inference\"\n",
    ")\n",
    "\n",
    "\n",
    "# Retrieve the model uri. This includes the model and model parameters.\n",
    "model_uri = model_uris.retrieve(\n",
    "    model_id=model_id, model_version=model_version, model_scope=\"inference\"\n",
    ")\n",
    "\n",
    "\n",
    "# Create the SageMaker model instance\n",
    "embedding_model = Model(\n",
    "    image_uri=deploy_image_uri,\n",
    "    source_dir=deploy_source_uri,\n",
    "    model_data=model_uri,\n",
    "    entry_point=\"inference.py\",  # entry point file in source_dir and present in deploy_source_uri\n",
    "    role=sagemaker_execution_role,\n",
    "    predictor_cls=Predictor,\n",
    "    name=model_name,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data pre-processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this demonstration we will use a dataset made up of the blog titles for each blog published by AWS from 2004 until late 2022. We use the blog titles to cluster them and assign a topic to each cluster\n",
    "\n",
    "The text is pre-processed with the following steps:\n",
    "\n",
    "* Set category string to lowercase\n",
    "* Replace acronyms with actual words \n",
    "* Replace special word-bound characters such as / and - (i.e.: imagenes/videos, cerveza-vino) to get separate words.\n",
    "* Eliminate explanations between parenthesis\n",
    "* Remove any other non-word characters from sentence\n",
    "* Split sentence into tokens\n",
    "* Singularize each token"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "blogs_df = pd.read_csv('aws_blog_titles.csv', header=None, names=['URL', 'Title'])\n",
    "\n",
    "aws_acronyms_df = pd.read_csv('acronyms.csv', header=None, delimiter=';', names=['acronym', 'meaning'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>URL</th>\n",
       "      <th>Title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>https://aws.amazon.com/blogs/aws/10000_sheep_col/</td>\n",
       "      <td>10 000 Sheep Collaborative Art Project</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>https://aws.amazon.com/blogs/apn/10-best-pract...</td>\n",
       "      <td>10 Best Practices to Help Partners Build AWS Q...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>https://aws.amazon.com/blogs/architecture/ten-...</td>\n",
       "      <td>10 Things Serverless Architects Should Know</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>https://aws.amazon.com/blogs/apn/10-years-of-s...</td>\n",
       "      <td>10 Years of Success AWS and Alert Logic</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>https://aws.amazon.com/blogs/apn/10-years-of-s...</td>\n",
       "      <td>10 Years of Success AWS and Appian</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23195</th>\n",
       "      <td>https://aws.amazon.com/blogs/media/reinvent-bo...</td>\n",
       "      <td>re Invent bonus content M E focused sessions o...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23196</th>\n",
       "      <td>https://aws.amazon.com/blogs/opensource/reinve...</td>\n",
       "      <td>re Invent open source highlights Week 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23197</th>\n",
       "      <td>https://aws.amazon.com/blogs/startups/redbus-b...</td>\n",
       "      <td>redBus Building a Data Platform with AWS Apach...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23198</th>\n",
       "      <td>https://aws.amazon.com/blogs/security/s2n-and-...</td>\n",
       "      <td>s2n and Lucky 13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23199</th>\n",
       "      <td>https://aws.amazon.com/blogs/startups/wefoxgro...</td>\n",
       "      <td>wefoxgroup s Migration to AWS and Amazon EKS</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>23200 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                     URL  \\\n",
       "0      https://aws.amazon.com/blogs/aws/10000_sheep_col/   \n",
       "1      https://aws.amazon.com/blogs/apn/10-best-pract...   \n",
       "2      https://aws.amazon.com/blogs/architecture/ten-...   \n",
       "3      https://aws.amazon.com/blogs/apn/10-years-of-s...   \n",
       "4      https://aws.amazon.com/blogs/apn/10-years-of-s...   \n",
       "...                                                  ...   \n",
       "23195  https://aws.amazon.com/blogs/media/reinvent-bo...   \n",
       "23196  https://aws.amazon.com/blogs/opensource/reinve...   \n",
       "23197  https://aws.amazon.com/blogs/startups/redbus-b...   \n",
       "23198  https://aws.amazon.com/blogs/security/s2n-and-...   \n",
       "23199  https://aws.amazon.com/blogs/startups/wefoxgro...   \n",
       "\n",
       "                                                   Title  \n",
       "0                 10 000 Sheep Collaborative Art Project  \n",
       "1      10 Best Practices to Help Partners Build AWS Q...  \n",
       "2            10 Things Serverless Architects Should Know  \n",
       "3                10 Years of Success AWS and Alert Logic  \n",
       "4                     10 Years of Success AWS and Appian  \n",
       "...                                                  ...  \n",
       "23195  re Invent bonus content M E focused sessions o...  \n",
       "23196            re Invent open source highlights Week 1  \n",
       "23197  redBus Building a Data Platform with AWS Apach...  \n",
       "23198                                   s2n and Lucky 13  \n",
       "23199       wefoxgroup s Migration to AWS and Amazon EKS  \n",
       "\n",
       "[23200 rows x 2 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "blogs_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>acronym</th>\n",
       "      <th>meaning</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Amazon ES</td>\n",
       "      <td>Amazon Elasticsearch Service</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AMI</td>\n",
       "      <td>Amazon Machine Image</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>API</td>\n",
       "      <td>Application Programming Interface</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AI</td>\n",
       "      <td>Artificial Intelligence</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ACL</td>\n",
       "      <td>Access Control List</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>120</th>\n",
       "      <td>VPN</td>\n",
       "      <td>Virtual Private Network</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>121</th>\n",
       "      <td>VLAN</td>\n",
       "      <td>Virtual Local Area Network</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>122</th>\n",
       "      <td>VDI</td>\n",
       "      <td>Virtual Desktop Infrastructure</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>123</th>\n",
       "      <td>VPG</td>\n",
       "      <td>Virtual Private Gateway</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <td>WAF</td>\n",
       "      <td>Web Application Firewall</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>125 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       acronym                            meaning\n",
       "0    Amazon ES       Amazon Elasticsearch Service\n",
       "1          AMI               Amazon Machine Image\n",
       "2          API  Application Programming Interface\n",
       "3           AI            Artificial Intelligence\n",
       "4          ACL                Access Control List\n",
       "..         ...                                ...\n",
       "120        VPN            Virtual Private Network\n",
       "121       VLAN         Virtual Local Area Network\n",
       "122        VDI     Virtual Desktop Infrastructure\n",
       "123        VPG            Virtual Private Gateway\n",
       "124        WAF           Web Application Firewall\n",
       "\n",
       "[125 rows x 2 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#We transform acronyms to their actual meaning since the transformer may not be aware of them (as it was not trained in this specific vocabulary)\n",
    "\n",
    "aws_acronyms_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "blogs_df = blogs_df.drop_duplicates(subset=['Title']).reset_index()\n",
    "blogs_df = blogs_df.drop(columns=['index', 'URL'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "#For eficiency only take sample_size titles at random\n",
    "\n",
    "sample_size = 1000\n",
    "blogs_df_sample = blogs_df.sample(n=sample_size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "titles = blogs_df_sample['Title'].tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatized = []\n",
    "for title in titles:\n",
    "    sentence = title.lower()\n",
    "    sentence = re.sub(r'[^a-zA-Z0-9_-áéíóúñ ]', r'', sentence)  #remove extraneous characters (maybe a different encoding)\n",
    "    lemmatized.append(sentence)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['nova brings data science courses to the usmc with aws educate',\n",
       " 'cloud native application monitoring for aws',\n",
       " 'sharing matlab applications on aws using the matlab web app server',\n",
       " 'amazon elastic file system shared file storage for amazon ec2',\n",
       " 'duplicating infrastructure on aws',\n",
       " 'amazon kinesis analytics process streaming data in real time with sql',\n",
       " 'how to enable 360 degree analytics and innovate faster on aws with datavard glue for sap',\n",
       " 'amazon monitron a simple and cost effective service enabling predictive maintenance',\n",
       " 'new aws partner program launches and updates announced at re invent 2020',\n",
       " 'online tech talk april 23 persistent storage for containers with amazon efs']"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lemmatized[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Create phrase (sentence) embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# These functions are used to query the endpoint and parse the response\n",
    "\n",
    "def query(model_predictor, text):\n",
    "    \"\"\"Query the model predictor.\"\"\"\n",
    "\n",
    "    encoded_text = json.dumps(text).encode(\"utf-8\")\n",
    "\n",
    "    query_response = model_predictor.predict(\n",
    "        encoded_text,\n",
    "        {\n",
    "            \"ContentType\": \"application/x-text\",\n",
    "            \"Accept\": \"application/json\",\n",
    "        },\n",
    "    )\n",
    "    return query_response\n",
    "\n",
    "\n",
    "def parse_response(query_response):\n",
    "    \"\"\"Parse response and return the embedding.\"\"\"\n",
    "\n",
    "    model_predictions = json.loads(query_response)\n",
    "    embedding = model_predictions[\"embedding\"]\n",
    "    return embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deploy the selected model to an endpoint for real time inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-----!"
     ]
    }
   ],
   "source": [
    "# deploy the Model. Note that we need to pass Predictor class when we deploy model through Model class,\n",
    "# for being able to run inference through the sagemaker API.\n",
    "model_predictor = embedding_model.deploy(\n",
    "    initial_instance_count=1,\n",
    "    instance_type=inference_instance_type,\n",
    "    predictor_cls=Predictor,\n",
    "    endpoint_name=model_name,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the deployed model to generate the embeddings for each of the titles in our sample dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "#model_predictor = Predictor('jumpstart-example-infer-tensorflow-tcem-2023-01-19-23-23-44-619')  #Specifiy endpoint name in case you wanna use an already deployed endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.37 s, sys: 142 ms, total: 2.52 s\n",
      "Wall time: 4min 59s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "sentence_vectors = [parse_response(query(model_predictor, title)) for title in lemmatized]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "encoded_titles_df = pd.DataFrame(sentence_vectors)\n",
    "encoded_titles_df['blog_title_lemmatized'] = lemmatized"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>1015</th>\n",
       "      <th>1016</th>\n",
       "      <th>1017</th>\n",
       "      <th>1018</th>\n",
       "      <th>1019</th>\n",
       "      <th>1020</th>\n",
       "      <th>1021</th>\n",
       "      <th>1022</th>\n",
       "      <th>1023</th>\n",
       "      <th>blog_title_lemmatized</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-1.259244</td>\n",
       "      <td>0.102127</td>\n",
       "      <td>0.089923</td>\n",
       "      <td>-0.820289</td>\n",
       "      <td>-0.223537</td>\n",
       "      <td>-0.432897</td>\n",
       "      <td>0.799307</td>\n",
       "      <td>-0.674757</td>\n",
       "      <td>-0.252520</td>\n",
       "      <td>-0.272772</td>\n",
       "      <td>...</td>\n",
       "      <td>0.141702</td>\n",
       "      <td>-0.392942</td>\n",
       "      <td>0.119546</td>\n",
       "      <td>-0.386988</td>\n",
       "      <td>-0.568477</td>\n",
       "      <td>0.205832</td>\n",
       "      <td>-0.170402</td>\n",
       "      <td>-1.025766</td>\n",
       "      <td>0.198024</td>\n",
       "      <td>nova brings data science courses to the usmc w...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.692962</td>\n",
       "      <td>0.259012</td>\n",
       "      <td>-0.748741</td>\n",
       "      <td>-0.866656</td>\n",
       "      <td>-0.901526</td>\n",
       "      <td>-0.997768</td>\n",
       "      <td>0.377633</td>\n",
       "      <td>-0.053566</td>\n",
       "      <td>-0.503752</td>\n",
       "      <td>0.144736</td>\n",
       "      <td>...</td>\n",
       "      <td>0.114172</td>\n",
       "      <td>0.522053</td>\n",
       "      <td>0.834067</td>\n",
       "      <td>-0.771977</td>\n",
       "      <td>-0.466612</td>\n",
       "      <td>0.195516</td>\n",
       "      <td>-0.715755</td>\n",
       "      <td>0.326673</td>\n",
       "      <td>-0.530868</td>\n",
       "      <td>cloud native application monitoring for aws</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.652742</td>\n",
       "      <td>0.126623</td>\n",
       "      <td>0.094145</td>\n",
       "      <td>-0.545012</td>\n",
       "      <td>-0.761059</td>\n",
       "      <td>-0.479529</td>\n",
       "      <td>-0.119898</td>\n",
       "      <td>0.001302</td>\n",
       "      <td>-0.347351</td>\n",
       "      <td>0.485530</td>\n",
       "      <td>...</td>\n",
       "      <td>0.158906</td>\n",
       "      <td>-0.314114</td>\n",
       "      <td>0.502422</td>\n",
       "      <td>-0.112350</td>\n",
       "      <td>-0.143338</td>\n",
       "      <td>-0.790585</td>\n",
       "      <td>-0.104163</td>\n",
       "      <td>-0.003831</td>\n",
       "      <td>-0.414520</td>\n",
       "      <td>sharing matlab applications on aws using the m...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.157903</td>\n",
       "      <td>-1.218128</td>\n",
       "      <td>-0.358249</td>\n",
       "      <td>-0.143645</td>\n",
       "      <td>-0.785935</td>\n",
       "      <td>0.111689</td>\n",
       "      <td>0.451628</td>\n",
       "      <td>-0.040183</td>\n",
       "      <td>0.232529</td>\n",
       "      <td>0.813196</td>\n",
       "      <td>...</td>\n",
       "      <td>0.574929</td>\n",
       "      <td>-0.595977</td>\n",
       "      <td>0.733360</td>\n",
       "      <td>0.228505</td>\n",
       "      <td>0.484546</td>\n",
       "      <td>0.407535</td>\n",
       "      <td>-0.656069</td>\n",
       "      <td>0.465197</td>\n",
       "      <td>0.631824</td>\n",
       "      <td>amazon elastic file system shared file storage...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.199768</td>\n",
       "      <td>0.498280</td>\n",
       "      <td>-0.096403</td>\n",
       "      <td>-1.235060</td>\n",
       "      <td>-0.164219</td>\n",
       "      <td>-0.747414</td>\n",
       "      <td>-0.280207</td>\n",
       "      <td>0.717804</td>\n",
       "      <td>-1.201374</td>\n",
       "      <td>0.192102</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.422836</td>\n",
       "      <td>0.048728</td>\n",
       "      <td>0.729825</td>\n",
       "      <td>-0.974279</td>\n",
       "      <td>0.452009</td>\n",
       "      <td>0.646301</td>\n",
       "      <td>-0.731122</td>\n",
       "      <td>0.526780</td>\n",
       "      <td>-0.412448</td>\n",
       "      <td>duplicating infrastructure on aws</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>-0.190524</td>\n",
       "      <td>0.190522</td>\n",
       "      <td>0.247199</td>\n",
       "      <td>-0.909789</td>\n",
       "      <td>-0.532487</td>\n",
       "      <td>-0.958266</td>\n",
       "      <td>-0.366771</td>\n",
       "      <td>-0.469838</td>\n",
       "      <td>-0.685504</td>\n",
       "      <td>-0.749760</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.482015</td>\n",
       "      <td>0.358346</td>\n",
       "      <td>0.935664</td>\n",
       "      <td>-0.944232</td>\n",
       "      <td>0.374791</td>\n",
       "      <td>0.138823</td>\n",
       "      <td>0.717717</td>\n",
       "      <td>-0.581709</td>\n",
       "      <td>0.456586</td>\n",
       "      <td>aws week in review august 25 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>-0.524172</td>\n",
       "      <td>-0.508515</td>\n",
       "      <td>0.474869</td>\n",
       "      <td>-1.093429</td>\n",
       "      <td>-1.255689</td>\n",
       "      <td>-0.688424</td>\n",
       "      <td>0.898723</td>\n",
       "      <td>0.750922</td>\n",
       "      <td>-1.154738</td>\n",
       "      <td>1.334882</td>\n",
       "      <td>...</td>\n",
       "      <td>0.765626</td>\n",
       "      <td>-0.240315</td>\n",
       "      <td>0.339106</td>\n",
       "      <td>-0.142825</td>\n",
       "      <td>0.226154</td>\n",
       "      <td>-0.350119</td>\n",
       "      <td>-0.670260</td>\n",
       "      <td>-0.665765</td>\n",
       "      <td>-0.013468</td>\n",
       "      <td>optimizing performance for users in china with...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>-0.404731</td>\n",
       "      <td>0.581385</td>\n",
       "      <td>-0.491551</td>\n",
       "      <td>-0.588944</td>\n",
       "      <td>-0.035981</td>\n",
       "      <td>-0.529826</td>\n",
       "      <td>0.459939</td>\n",
       "      <td>0.522350</td>\n",
       "      <td>-0.774718</td>\n",
       "      <td>0.081250</td>\n",
       "      <td>...</td>\n",
       "      <td>0.718470</td>\n",
       "      <td>-0.013323</td>\n",
       "      <td>1.003814</td>\n",
       "      <td>-0.764670</td>\n",
       "      <td>-0.364241</td>\n",
       "      <td>0.231567</td>\n",
       "      <td>-0.153268</td>\n",
       "      <td>-0.106582</td>\n",
       "      <td>-1.071888</td>\n",
       "      <td>aws accelerator for citrix migrate or deploy x...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>0.307507</td>\n",
       "      <td>0.646797</td>\n",
       "      <td>0.672350</td>\n",
       "      <td>-0.879446</td>\n",
       "      <td>-0.829220</td>\n",
       "      <td>-0.173903</td>\n",
       "      <td>0.130434</td>\n",
       "      <td>-0.228525</td>\n",
       "      <td>0.172912</td>\n",
       "      <td>-0.866439</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.857214</td>\n",
       "      <td>-0.324904</td>\n",
       "      <td>1.445891</td>\n",
       "      <td>-0.198214</td>\n",
       "      <td>-0.104684</td>\n",
       "      <td>0.646890</td>\n",
       "      <td>-0.060024</td>\n",
       "      <td>-0.384143</td>\n",
       "      <td>0.098519</td>\n",
       "      <td>in case you missed it september 2019 top blog ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>0.926491</td>\n",
       "      <td>-0.443366</td>\n",
       "      <td>0.707322</td>\n",
       "      <td>-0.335114</td>\n",
       "      <td>-1.174717</td>\n",
       "      <td>-0.123556</td>\n",
       "      <td>0.919839</td>\n",
       "      <td>0.648331</td>\n",
       "      <td>0.875201</td>\n",
       "      <td>-0.280192</td>\n",
       "      <td>...</td>\n",
       "      <td>0.192803</td>\n",
       "      <td>-0.163193</td>\n",
       "      <td>0.983904</td>\n",
       "      <td>-0.519547</td>\n",
       "      <td>0.676546</td>\n",
       "      <td>0.539295</td>\n",
       "      <td>-0.293465</td>\n",
       "      <td>0.378826</td>\n",
       "      <td>0.312693</td>\n",
       "      <td>why avatars are usually awful and how snappr f...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 1025 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            0         1         2         3         4         5         6  \\\n",
       "0   -1.259244  0.102127  0.089923 -0.820289 -0.223537 -0.432897  0.799307   \n",
       "1   -0.692962  0.259012 -0.748741 -0.866656 -0.901526 -0.997768  0.377633   \n",
       "2   -0.652742  0.126623  0.094145 -0.545012 -0.761059 -0.479529 -0.119898   \n",
       "3    0.157903 -1.218128 -0.358249 -0.143645 -0.785935  0.111689  0.451628   \n",
       "4   -0.199768  0.498280 -0.096403 -1.235060 -0.164219 -0.747414 -0.280207   \n",
       "..        ...       ...       ...       ...       ...       ...       ...   \n",
       "995 -0.190524  0.190522  0.247199 -0.909789 -0.532487 -0.958266 -0.366771   \n",
       "996 -0.524172 -0.508515  0.474869 -1.093429 -1.255689 -0.688424  0.898723   \n",
       "997 -0.404731  0.581385 -0.491551 -0.588944 -0.035981 -0.529826  0.459939   \n",
       "998  0.307507  0.646797  0.672350 -0.879446 -0.829220 -0.173903  0.130434   \n",
       "999  0.926491 -0.443366  0.707322 -0.335114 -1.174717 -0.123556  0.919839   \n",
       "\n",
       "            7         8         9  ...      1015      1016      1017  \\\n",
       "0   -0.674757 -0.252520 -0.272772  ...  0.141702 -0.392942  0.119546   \n",
       "1   -0.053566 -0.503752  0.144736  ...  0.114172  0.522053  0.834067   \n",
       "2    0.001302 -0.347351  0.485530  ...  0.158906 -0.314114  0.502422   \n",
       "3   -0.040183  0.232529  0.813196  ...  0.574929 -0.595977  0.733360   \n",
       "4    0.717804 -1.201374  0.192102  ... -0.422836  0.048728  0.729825   \n",
       "..        ...       ...       ...  ...       ...       ...       ...   \n",
       "995 -0.469838 -0.685504 -0.749760  ... -0.482015  0.358346  0.935664   \n",
       "996  0.750922 -1.154738  1.334882  ...  0.765626 -0.240315  0.339106   \n",
       "997  0.522350 -0.774718  0.081250  ...  0.718470 -0.013323  1.003814   \n",
       "998 -0.228525  0.172912 -0.866439  ... -0.857214 -0.324904  1.445891   \n",
       "999  0.648331  0.875201 -0.280192  ...  0.192803 -0.163193  0.983904   \n",
       "\n",
       "         1018      1019      1020      1021      1022      1023  \\\n",
       "0   -0.386988 -0.568477  0.205832 -0.170402 -1.025766  0.198024   \n",
       "1   -0.771977 -0.466612  0.195516 -0.715755  0.326673 -0.530868   \n",
       "2   -0.112350 -0.143338 -0.790585 -0.104163 -0.003831 -0.414520   \n",
       "3    0.228505  0.484546  0.407535 -0.656069  0.465197  0.631824   \n",
       "4   -0.974279  0.452009  0.646301 -0.731122  0.526780 -0.412448   \n",
       "..        ...       ...       ...       ...       ...       ...   \n",
       "995 -0.944232  0.374791  0.138823  0.717717 -0.581709  0.456586   \n",
       "996 -0.142825  0.226154 -0.350119 -0.670260 -0.665765 -0.013468   \n",
       "997 -0.764670 -0.364241  0.231567 -0.153268 -0.106582 -1.071888   \n",
       "998 -0.198214 -0.104684  0.646890 -0.060024 -0.384143  0.098519   \n",
       "999 -0.519547  0.676546  0.539295 -0.293465  0.378826  0.312693   \n",
       "\n",
       "                                 blog_title_lemmatized  \n",
       "0    nova brings data science courses to the usmc w...  \n",
       "1          cloud native application monitoring for aws  \n",
       "2    sharing matlab applications on aws using the m...  \n",
       "3    amazon elastic file system shared file storage...  \n",
       "4                    duplicating infrastructure on aws  \n",
       "..                                                 ...  \n",
       "995                  aws week in review august 25 2014  \n",
       "996  optimizing performance for users in china with...  \n",
       "997  aws accelerator for citrix migrate or deploy x...  \n",
       "998  in case you missed it september 2019 top blog ...  \n",
       "999  why avatars are usually awful and how snappr f...  \n",
       "\n",
       "[1000 rows x 1025 columns]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoded_titles_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cluster phrases (sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spectral clsutering is a clustering algorithm based on graph theory. Spectral clustering uses information from the eigenvalues (spectrum) of the Laplacian matrix built from the graph or the data set to create groups (clusters) of data. Spectral clustering requires a measures of affinity between data points, for this application we use cosine affinity because we are interested in sentences that lie near to each other but also with \"similar meaning\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_clusters = 20\n",
    "\n",
    "clustering_model = SpectralClustering(n_clusters=n_clusters, n_init=100, affinity='cosine', n_neighbors=10, assign_labels=\"kmeans\", random_state=0)\n",
    "embeddings = encoded_titles_df[encoded_titles_df.columns[0:-1]]\n",
    "encoded_titles_df['cluster'] = clustering_model.fit_predict(embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters_titles = encoded_titles_df[['blog_title_lemmatized', 'cluster']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>blog_title_lemmatized</th>\n",
       "      <th>cluster</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>nova brings data science courses to the usmc w...</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>cloud native application monitoring for aws</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>sharing matlab applications on aws using the m...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>amazon elastic file system shared file storage...</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>duplicating infrastructure on aws</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>aws week in review august 25 2014</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>optimizing performance for users in china with...</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>aws accelerator for citrix migrate or deploy x...</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>in case you missed it september 2019 top blog ...</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>why avatars are usually awful and how snappr f...</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 blog_title_lemmatized  cluster\n",
       "0    nova brings data science courses to the usmc w...        9\n",
       "1          cloud native application monitoring for aws       17\n",
       "2    sharing matlab applications on aws using the m...        0\n",
       "3    amazon elastic file system shared file storage...        5\n",
       "4                    duplicating infrastructure on aws       17\n",
       "..                                                 ...      ...\n",
       "995                  aws week in review august 25 2014        6\n",
       "996  optimizing performance for users in china with...        8\n",
       "997  aws accelerator for citrix migrate or deploy x...       17\n",
       "998  in case you missed it september 2019 top blog ...        6\n",
       "999  why avatars are usually awful and how snappr f...        9\n",
       "\n",
       "[1000 rows x 2 columns]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters_titles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Automatic cluster labeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use TF-IDF for finding the keywords in each of our clusters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Text Frequency - Inverse Document Frequency is an NLP technique used to find the most relevant terms in set of documents (phrases in our case). From each cluster we extract its most relevant terms (nouns only) according to TF-IDF and use those as labels/categories for that cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters = [clusters_titles.loc[clusters_titles.cluster == i, 'blog_title_lemmatized'].to_list() for i in range(0,n_clusters)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters_tf_idf = []\n",
    "clusters_tf_idf_terms = []\n",
    "clusters_tags = []\n",
    "clusters_keywords_tf_idf = []\n",
    "tf_idf_threshold = 0.2\n",
    "\n",
    "for cluster in clusters:\n",
    "\n",
    "    tfIdfVectorizer = TfidfVectorizer(use_idf=True)\n",
    "    tfIdf = tfIdfVectorizer.fit_transform(cluster)\n",
    "    tf_idf_df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=[\"TF-IDF\"])\n",
    "    tf_idf_df = tf_idf_df.sort_values('TF-IDF', ascending=False)\n",
    "    \n",
    "    clusters_tf_idf.append(tf_idf_df)\n",
    "    \n",
    "    cluster_tf_idf_terms = list(tf_idf_df.loc[tf_idf_df['TF-IDF'] > tf_idf_threshold].index.values)\n",
    "    clusters_tf_idf_terms.append(cluster_tf_idf_terms)\n",
    "    \n",
    "    tags = nltk.pos_tag(cluster_tf_idf_terms)\n",
    "    clusters_tags.append(tags)\n",
    "    \n",
    "    keywords = [tag[0] for tag in tags if tag[1] in ['NN', 'NNS'] and tag[0] not in ['aws', 'amazon']]\n",
    "    clusters_keywords_tf_idf.append(keywords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters_idf = []\n",
    "\n",
    "for cluster in clusters:\n",
    "\n",
    "    cv = CountVectorizer() \n",
    "    word_count_vector = cv.fit_transform(cluster)\n",
    "\n",
    "    tfidf_transformer = TfidfTransformer(smooth_idf=True,use_idf=True) \n",
    "    tfidf_transformer.fit(word_count_vector)\n",
    "\n",
    "    df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=[\"idf_weights\"])\n",
    "    df_idf['word'] = cv.get_feature_names()\n",
    "\n",
    "    df_idf = df_idf.sort_values(by='idf_weights')\n",
    "    clusters_idf.append(df_idf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['matlab', 'applications', 'app'],\n",
       " ['gen', 'adopters', 'problems'],\n",
       " ['model', 'register', 'security'],\n",
       " ['tool', 'risks', 'manage', 'systems'],\n",
       " ['messaging'],\n",
       " ['file', 'storage', 'system'],\n",
       " ['september', 'week', 'review'],\n",
       " ['connectivity', 'options', 'gigabit'],\n",
       " ['quality', 'call', 'connect', 'detection', 'time', 'opensearch'],\n",
       " ['educate', 'nova', 'brings', 'courses', 'science', 'data'],\n",
       " ['service', 'monitron', 'simple', 'maintenance', 'cost'],\n",
       " ['service', 'database'],\n",
       " ['uploads', 's3'],\n",
       " ['process', 'streaming', 'time', 'analytics', 'kinesis'],\n",
       " ['reports', 'soc'],\n",
       " ['sydney', 'university', 'technology', 'stroke', 'rehabilitation', 'robots'],\n",
       " ['jenkins', 'party', 'create', 'source', 'projects', 'control'],\n",
       " ['application', 'monitoring', 'cloud'],\n",
       " ['efs', 'talk', 'tech', 'online', 'containers', 'storage'],\n",
       " ['program', 'updates', 'partner']]"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters_keywords_tf_idf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>idf_weights</th>\n",
       "      <th>word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>aws</th>\n",
       "      <td>1.117783</td>\n",
       "      <td>aws</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <td>1.492476</td>\n",
       "      <td>the</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>for</th>\n",
       "      <td>1.750306</td>\n",
       "      <td>for</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sdk</th>\n",
       "      <td>1.944462</td>\n",
       "      <td>sdk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>with</th>\n",
       "      <td>1.944462</td>\n",
       "      <td>with</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>and</th>\n",
       "      <td>2.098612</td>\n",
       "      <td>and</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>to</th>\n",
       "      <td>2.386294</td>\n",
       "      <td>to</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>using</th>\n",
       "      <td>2.504077</td>\n",
       "      <td>using</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>android</th>\n",
       "      <td>2.791759</td>\n",
       "      <td>android</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>on</th>\n",
       "      <td>2.791759</td>\n",
       "      <td>on</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>in</th>\n",
       "      <td>2.791759</td>\n",
       "      <td>in</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>an</th>\n",
       "      <td>2.791759</td>\n",
       "      <td>an</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>console</th>\n",
       "      <td>2.974081</td>\n",
       "      <td>console</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>app</th>\n",
       "      <td>2.974081</td>\n",
       "      <td>app</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>application</th>\n",
       "      <td>2.974081</td>\n",
       "      <td>application</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>amazon</th>\n",
       "      <td>2.974081</td>\n",
       "      <td>amazon</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>new</th>\n",
       "      <td>2.974081</td>\n",
       "      <td>new</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>open</th>\n",
       "      <td>3.197225</td>\n",
       "      <td>open</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>release</th>\n",
       "      <td>3.197225</td>\n",
       "      <td>release</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>build</th>\n",
       "      <td>3.197225</td>\n",
       "      <td>build</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             idf_weights         word\n",
       "aws             1.117783          aws\n",
       "the             1.492476          the\n",
       "for             1.750306          for\n",
       "sdk             1.944462          sdk\n",
       "with            1.944462         with\n",
       "and             2.098612          and\n",
       "to              2.386294           to\n",
       "using           2.504077        using\n",
       "android         2.791759      android\n",
       "on              2.791759           on\n",
       "in              2.791759           in\n",
       "an              2.791759           an\n",
       "console         2.974081      console\n",
       "app             2.974081          app\n",
       "application     2.974081  application\n",
       "amazon          2.974081       amazon\n",
       "new             2.974081          new\n",
       "open            3.197225         open\n",
       "release         3.197225      release\n",
       "build           3.197225        build"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters_idf[0][:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters_keywords_idf = []\n",
    "stop_words = stopwords.words('english')\n",
    "\n",
    "clusters_idf_words = [cluster['word'].to_list()[0:10] for cluster in clusters_idf]\n",
    "s = [ word for word in stop_words if word != 're'] #Remove stopwords but the word re (for re: invent)\n",
    "\n",
    "for cluster in clusters_idf_words:\n",
    "    tags = nltk.pos_tag(cluster)\n",
    "    words = [ tag[0] for tag in tags if tag[1] in ['NN', 'NNS'] and tag[0] not in ['aws', 'amazon']]\n",
    "    \n",
    "    clusters_keywords_idf.append(\",\".join(words))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['sdk,android',\n",
       " '',\n",
       " 'security,summit',\n",
       " 'manager,systems,management',\n",
       " '',\n",
       " 'instances,instance',\n",
       " 'review,week,september,part',\n",
       " 'partners',\n",
       " 'service,rekognition',\n",
       " 'data',\n",
       " 'cloud,data,management',\n",
       " 'rds,database',\n",
       " 'cloudformation,update',\n",
       " 'data,redshift,dynamodb,s3',\n",
       " 'source,time',\n",
       " '',\n",
       " 'sagemaker,model,inference,data',\n",
       " 'workloads',\n",
       " '',\n",
       " 're,invent,guide']"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters_keywords_idf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batch process the entire dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section we will create batch processing jobs to process the entire dataset (roughly 24K titles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "import jsonlines\n",
    "\n",
    "from sagemaker.s3 import S3Downloader,S3Uploader,s3_path_join\n",
    "\n",
    "n_clusters = 20"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "bucket_name = 'unsupervised-phrase-clustering-with-sagemaker' #<REPLACE_WITH_YOUR_BUCKET_NAME>\n",
    "s3_prefix = 'text-clustering-with-transformers/data'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Chose a model for Inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "#We choose the tensorflow-tcembedding-universal-sentence-encoder-cmlm-en-large-1 as the default model since it is better suited for phrase analysis\n",
    "\n",
    "model_id, model_version = (\n",
    "    \"tensorflow-tcembedding-universal-sentence-encoder-cmlm-en-large-1\", \n",
    "    \"*\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can continue with the default model, or can choose a different model from the dropdown generated upon running the next cell. A complete list of SageMaker pre-trained models can also be accessed at [Sagemaker pre-trained Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the model from the selected model_id, model_version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.utils import name_from_base\n",
    "\n",
    "model_name = name_from_base(f\"jumpstart-example-infer-gpu-{model_id}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker import image_uris, model_uris, script_uris\n",
    "from sagemaker.model import Model\n",
    "from sagemaker.predictor import Predictor\n",
    "\n",
    "batch_transform_instance_type = \"ml.g4dn.xlarge\"\n",
    "\n",
    "# Retrieve the inference docker container uri. This is the base Tensorflow container image for the default model above.\n",
    "deploy_image_uri = image_uris.retrieve(\n",
    "    region=None,\n",
    "    framework=None,  # automatically inferred from model_id\n",
    "    image_scope=\"inference\",\n",
    "    model_id=model_id,\n",
    "    model_version=model_version,\n",
    "    instance_type=batch_transform_instance_type,\n",
    ")\n",
    "\n",
    "# Retrieve the inference script uri. This includes all dependencies and scripts for model loading, inference handling etc.\n",
    "deploy_source_uri = script_uris.retrieve(\n",
    "    model_id=model_id, model_version=model_version, script_scope=\"inference\"\n",
    ")\n",
    "\n",
    "\n",
    "# Retrieve the model uri. This includes the model and model parameters.\n",
    "model_uri = model_uris.retrieve(\n",
    "    model_id=model_id, model_version=model_version, model_scope=\"inference\"\n",
    ")\n",
    "\n",
    "\n",
    "# Create the SageMaker model instance\n",
    "batch_transform_embedding_model = Model(\n",
    "    image_uri=deploy_image_uri,\n",
    "    source_dir=deploy_source_uri,\n",
    "    model_data=model_uri,\n",
    "    entry_point=\"inference.py\",  # entry point file in source_dir and present in deploy_source_uri\n",
    "    role=sagemaker_execution_role,\n",
    "    name=model_name,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "blogs_df = pd.read_csv('aws_blog_titles.csv', header=None, names=['URL', 'Title'])\n",
    "aws_acronyms_df = pd.read_csv('acronyms.csv', header=None, delimiter=';', names=['acronym', 'meaning'])\n",
    "blogs_df = blogs_df.drop_duplicates(subset=['Title']).reset_index()\n",
    "blogs_df = blogs_df.drop(columns=['index', 'URL'])\n",
    "titles = blogs_df['Title'].tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatized = []\n",
    "for title in titles:\n",
    "    sentence = title.lower()\n",
    "    sentence = re.sub(r'[^a-zA-Z0-9_-áéíóúñ ]', r'', sentence)  #remove extraneous characters (maybe a different encoding)\n",
    "    lemmatized.append(sentence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Upload the pre-processed data to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_filename = 'aws_blog_titles.jsonl'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(batch_filename, \"wb\") as txt_file:\n",
    "    for title in lemmatized:\n",
    "        \n",
    "        txt_file.write(json.dumps(title).encode(\"utf-8\"))\n",
    "        txt_file.write(\"\\n\".encode('utf-8'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploading data to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/raw\n",
      "Uploaded data to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/raw\n"
     ]
    }
   ],
   "source": [
    "data_upload_path = s3_path_join(\"s3://\",bucket_name,s3_prefix, 'raw')\n",
    "print(f\"Uploading data to {data_upload_path}\")\n",
    "data_uri = S3Uploader.upload(batch_filename, data_upload_path)\n",
    "print(f\"Uploaded data to {data_upload_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create transformer to run a batch job\n",
    "\n",
    "output_path = s3_path_join(\"s3://\", bucket_name, s3_prefix, \"results\", \"embeddings\")\n",
    "\n",
    "batch_job = batch_transform_embedding_model.transformer(\n",
    "    instance_count=1,\n",
    "    instance_type=batch_transform_instance_type,\n",
    "    strategy='SingleRecord',\n",
    "    assemble_with='Line',\n",
    "    output_path=output_path,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Starts batch transform job and uses S3 data as input. Enable the logs and wait only if you pass a small number of samples (< 100).\n",
    "# You can monitor your batch processing job from the SageMaker Console -> Inference -> Batch transform jobs\n",
    "batch_job.transform(\n",
    "    data=data_upload_path,\n",
    "    content_type='application/x-text',    \n",
    "    split_type='Line',\n",
    "    logs=False,\n",
    "    wait=False\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading embeddings to .\n",
      "Downloaded embeddings to .\n"
     ]
    }
   ],
   "source": [
    "#Download the results. \n",
    "#The batch transformation job (step above) must have finished before you can run this cell.\n",
    "embedding_data_path = s3_path_join(\"s3://\", bucket_name, s3_prefix, \"results\", \"embeddings\", batch_filename+'.out')\n",
    "print(f\"Downloading embeddings to .\")\n",
    "S3Downloader.download(embedding_data_path,'.')\n",
    "print(f\"Downloaded embeddings to .\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "lines = []\n",
    "\n",
    "with jsonlines.open(batch_filename+\".out\", mode='r') as reader:\n",
    "    for obj in reader:\n",
    "        lines.append(obj['embedding'])\n",
    "        \n",
    "results_df = pd.DataFrame(lines)\n",
    "results_df['blog_title_lemmatized'] = lemmatized"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>1015</th>\n",
       "      <th>1016</th>\n",
       "      <th>1017</th>\n",
       "      <th>1018</th>\n",
       "      <th>1019</th>\n",
       "      <th>1020</th>\n",
       "      <th>1021</th>\n",
       "      <th>1022</th>\n",
       "      <th>1023</th>\n",
       "      <th>blog_title_lemmatized</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.188292</td>\n",
       "      <td>-0.390401</td>\n",
       "      <td>-0.286212</td>\n",
       "      <td>-0.732304</td>\n",
       "      <td>-0.841209</td>\n",
       "      <td>-1.088366</td>\n",
       "      <td>-0.598213</td>\n",
       "      <td>1.023898</td>\n",
       "      <td>0.821023</td>\n",
       "      <td>-1.058815</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.175997</td>\n",
       "      <td>0.245866</td>\n",
       "      <td>0.706102</td>\n",
       "      <td>0.783085</td>\n",
       "      <td>-0.823547</td>\n",
       "      <td>1.258304</td>\n",
       "      <td>-0.043126</td>\n",
       "      <td>-0.472476</td>\n",
       "      <td>-0.351324</td>\n",
       "      <td>10 000 sheep collaborative art project</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.264925</td>\n",
       "      <td>-0.108307</td>\n",
       "      <td>0.688968</td>\n",
       "      <td>-0.680849</td>\n",
       "      <td>-0.351448</td>\n",
       "      <td>-0.899282</td>\n",
       "      <td>0.391024</td>\n",
       "      <td>-0.635701</td>\n",
       "      <td>0.246767</td>\n",
       "      <td>-0.395383</td>\n",
       "      <td>...</td>\n",
       "      <td>0.673714</td>\n",
       "      <td>0.167650</td>\n",
       "      <td>1.476131</td>\n",
       "      <td>0.409790</td>\n",
       "      <td>-0.050008</td>\n",
       "      <td>0.582545</td>\n",
       "      <td>0.279141</td>\n",
       "      <td>-0.205470</td>\n",
       "      <td>-0.820358</td>\n",
       "      <td>10 best practices to help partners build aws q...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.371885</td>\n",
       "      <td>0.771379</td>\n",
       "      <td>1.069450</td>\n",
       "      <td>-1.008337</td>\n",
       "      <td>0.655574</td>\n",
       "      <td>-0.955685</td>\n",
       "      <td>-0.711391</td>\n",
       "      <td>0.625457</td>\n",
       "      <td>0.369845</td>\n",
       "      <td>-0.443280</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.096941</td>\n",
       "      <td>-0.025879</td>\n",
       "      <td>0.967100</td>\n",
       "      <td>-0.283345</td>\n",
       "      <td>0.691451</td>\n",
       "      <td>0.882124</td>\n",
       "      <td>0.534032</td>\n",
       "      <td>-0.739388</td>\n",
       "      <td>0.378053</td>\n",
       "      <td>10 things serverless architects should know</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.301370</td>\n",
       "      <td>0.847713</td>\n",
       "      <td>0.486485</td>\n",
       "      <td>-0.280148</td>\n",
       "      <td>-1.502466</td>\n",
       "      <td>-1.012401</td>\n",
       "      <td>-0.091017</td>\n",
       "      <td>0.317997</td>\n",
       "      <td>-0.872763</td>\n",
       "      <td>-0.809212</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.392543</td>\n",
       "      <td>0.050554</td>\n",
       "      <td>0.026191</td>\n",
       "      <td>-0.436454</td>\n",
       "      <td>-0.320372</td>\n",
       "      <td>0.507621</td>\n",
       "      <td>-0.305170</td>\n",
       "      <td>-0.284728</td>\n",
       "      <td>-0.776441</td>\n",
       "      <td>10 years of success aws and alert logic</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.029871</td>\n",
       "      <td>0.295120</td>\n",
       "      <td>0.633014</td>\n",
       "      <td>-0.412160</td>\n",
       "      <td>-1.094922</td>\n",
       "      <td>-0.794995</td>\n",
       "      <td>0.185795</td>\n",
       "      <td>-0.116586</td>\n",
       "      <td>-0.548522</td>\n",
       "      <td>-1.379145</td>\n",
       "      <td>...</td>\n",
       "      <td>0.548648</td>\n",
       "      <td>0.010977</td>\n",
       "      <td>0.174071</td>\n",
       "      <td>-0.338245</td>\n",
       "      <td>-0.298892</td>\n",
       "      <td>0.599860</td>\n",
       "      <td>0.190257</td>\n",
       "      <td>-0.457519</td>\n",
       "      <td>-1.003187</td>\n",
       "      <td>10 years of success aws and appian</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23125</th>\n",
       "      <td>-0.022962</td>\n",
       "      <td>1.256789</td>\n",
       "      <td>0.323558</td>\n",
       "      <td>-0.199724</td>\n",
       "      <td>0.179075</td>\n",
       "      <td>0.219997</td>\n",
       "      <td>0.139460</td>\n",
       "      <td>-0.184126</td>\n",
       "      <td>-0.289477</td>\n",
       "      <td>-0.732770</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.750940</td>\n",
       "      <td>0.333212</td>\n",
       "      <td>0.500258</td>\n",
       "      <td>-0.642244</td>\n",
       "      <td>-0.251067</td>\n",
       "      <td>1.796083</td>\n",
       "      <td>-0.214922</td>\n",
       "      <td>-0.485310</td>\n",
       "      <td>-0.257823</td>\n",
       "      <td>re invent bonus content m e focused sessions o...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23126</th>\n",
       "      <td>0.700433</td>\n",
       "      <td>0.953094</td>\n",
       "      <td>1.036029</td>\n",
       "      <td>-0.781870</td>\n",
       "      <td>-0.111338</td>\n",
       "      <td>-0.478488</td>\n",
       "      <td>0.362508</td>\n",
       "      <td>-0.422137</td>\n",
       "      <td>0.382570</td>\n",
       "      <td>-0.086834</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.269035</td>\n",
       "      <td>0.257387</td>\n",
       "      <td>0.684636</td>\n",
       "      <td>-0.135962</td>\n",
       "      <td>-0.267956</td>\n",
       "      <td>0.802172</td>\n",
       "      <td>-0.122769</td>\n",
       "      <td>-0.162395</td>\n",
       "      <td>0.466416</td>\n",
       "      <td>re invent open source highlights week 1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23127</th>\n",
       "      <td>0.096553</td>\n",
       "      <td>0.953412</td>\n",
       "      <td>0.493536</td>\n",
       "      <td>-1.050031</td>\n",
       "      <td>0.031639</td>\n",
       "      <td>-0.762166</td>\n",
       "      <td>-0.221962</td>\n",
       "      <td>0.420519</td>\n",
       "      <td>-0.328862</td>\n",
       "      <td>-0.170843</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.729130</td>\n",
       "      <td>0.518689</td>\n",
       "      <td>0.360414</td>\n",
       "      <td>0.419243</td>\n",
       "      <td>-0.244598</td>\n",
       "      <td>0.524628</td>\n",
       "      <td>-0.373507</td>\n",
       "      <td>-0.994924</td>\n",
       "      <td>-0.055463</td>\n",
       "      <td>redbus building a data platform with aws apach...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23128</th>\n",
       "      <td>-0.662509</td>\n",
       "      <td>-0.092624</td>\n",
       "      <td>0.064623</td>\n",
       "      <td>-0.446625</td>\n",
       "      <td>-0.261067</td>\n",
       "      <td>-0.437387</td>\n",
       "      <td>0.701777</td>\n",
       "      <td>0.340038</td>\n",
       "      <td>-0.293806</td>\n",
       "      <td>-0.226297</td>\n",
       "      <td>...</td>\n",
       "      <td>0.881936</td>\n",
       "      <td>0.450375</td>\n",
       "      <td>0.289184</td>\n",
       "      <td>0.122812</td>\n",
       "      <td>0.063564</td>\n",
       "      <td>0.710250</td>\n",
       "      <td>-0.277890</td>\n",
       "      <td>-1.099730</td>\n",
       "      <td>0.753118</td>\n",
       "      <td>s2n and lucky 13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23129</th>\n",
       "      <td>-0.542878</td>\n",
       "      <td>-0.237617</td>\n",
       "      <td>0.501378</td>\n",
       "      <td>-1.180179</td>\n",
       "      <td>0.353943</td>\n",
       "      <td>-0.340788</td>\n",
       "      <td>-0.538909</td>\n",
       "      <td>0.480047</td>\n",
       "      <td>-0.763953</td>\n",
       "      <td>-0.598589</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.372877</td>\n",
       "      <td>-0.195486</td>\n",
       "      <td>0.864477</td>\n",
       "      <td>0.186859</td>\n",
       "      <td>-1.127251</td>\n",
       "      <td>1.057047</td>\n",
       "      <td>-0.211675</td>\n",
       "      <td>0.377588</td>\n",
       "      <td>-0.937336</td>\n",
       "      <td>wefoxgroup s migration to aws and amazon eks</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>23130 rows × 1025 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              0         1         2         3         4         5         6  \\\n",
       "0      0.188292 -0.390401 -0.286212 -0.732304 -0.841209 -1.088366 -0.598213   \n",
       "1     -0.264925 -0.108307  0.688968 -0.680849 -0.351448 -0.899282  0.391024   \n",
       "2      0.371885  0.771379  1.069450 -1.008337  0.655574 -0.955685 -0.711391   \n",
       "3     -0.301370  0.847713  0.486485 -0.280148 -1.502466 -1.012401 -0.091017   \n",
       "4      0.029871  0.295120  0.633014 -0.412160 -1.094922 -0.794995  0.185795   \n",
       "...         ...       ...       ...       ...       ...       ...       ...   \n",
       "23125 -0.022962  1.256789  0.323558 -0.199724  0.179075  0.219997  0.139460   \n",
       "23126  0.700433  0.953094  1.036029 -0.781870 -0.111338 -0.478488  0.362508   \n",
       "23127  0.096553  0.953412  0.493536 -1.050031  0.031639 -0.762166 -0.221962   \n",
       "23128 -0.662509 -0.092624  0.064623 -0.446625 -0.261067 -0.437387  0.701777   \n",
       "23129 -0.542878 -0.237617  0.501378 -1.180179  0.353943 -0.340788 -0.538909   \n",
       "\n",
       "              7         8         9  ...      1015      1016      1017  \\\n",
       "0      1.023898  0.821023 -1.058815  ... -0.175997  0.245866  0.706102   \n",
       "1     -0.635701  0.246767 -0.395383  ...  0.673714  0.167650  1.476131   \n",
       "2      0.625457  0.369845 -0.443280  ... -0.096941 -0.025879  0.967100   \n",
       "3      0.317997 -0.872763 -0.809212  ... -0.392543  0.050554  0.026191   \n",
       "4     -0.116586 -0.548522 -1.379145  ...  0.548648  0.010977  0.174071   \n",
       "...         ...       ...       ...  ...       ...       ...       ...   \n",
       "23125 -0.184126 -0.289477 -0.732770  ... -0.750940  0.333212  0.500258   \n",
       "23126 -0.422137  0.382570 -0.086834  ... -0.269035  0.257387  0.684636   \n",
       "23127  0.420519 -0.328862 -0.170843  ... -0.729130  0.518689  0.360414   \n",
       "23128  0.340038 -0.293806 -0.226297  ...  0.881936  0.450375  0.289184   \n",
       "23129  0.480047 -0.763953 -0.598589  ... -0.372877 -0.195486  0.864477   \n",
       "\n",
       "           1018      1019      1020      1021      1022      1023  \\\n",
       "0      0.783085 -0.823547  1.258304 -0.043126 -0.472476 -0.351324   \n",
       "1      0.409790 -0.050008  0.582545  0.279141 -0.205470 -0.820358   \n",
       "2     -0.283345  0.691451  0.882124  0.534032 -0.739388  0.378053   \n",
       "3     -0.436454 -0.320372  0.507621 -0.305170 -0.284728 -0.776441   \n",
       "4     -0.338245 -0.298892  0.599860  0.190257 -0.457519 -1.003187   \n",
       "...         ...       ...       ...       ...       ...       ...   \n",
       "23125 -0.642244 -0.251067  1.796083 -0.214922 -0.485310 -0.257823   \n",
       "23126 -0.135962 -0.267956  0.802172 -0.122769 -0.162395  0.466416   \n",
       "23127  0.419243 -0.244598  0.524628 -0.373507 -0.994924 -0.055463   \n",
       "23128  0.122812  0.063564  0.710250 -0.277890 -1.099730  0.753118   \n",
       "23129  0.186859 -1.127251  1.057047 -0.211675  0.377588 -0.937336   \n",
       "\n",
       "                                   blog_title_lemmatized  \n",
       "0                 10 000 sheep collaborative art project  \n",
       "1      10 best practices to help partners build aws q...  \n",
       "2            10 things serverless architects should know  \n",
       "3                10 years of success aws and alert logic  \n",
       "4                     10 years of success aws and appian  \n",
       "...                                                  ...  \n",
       "23125  re invent bonus content m e focused sessions o...  \n",
       "23126            re invent open source highlights week 1  \n",
       "23127  redbus building a data platform with aws apach...  \n",
       "23128                                   s2n and lucky 13  \n",
       "23129       wefoxgroup s migration to aws and amazon eks  \n",
       "\n",
       "[23130 rows x 1025 columns]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings_filename = \"blog_title_embeddings.csv\"\n",
    "results_df.to_csv(embeddings_filename, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploading embeddings to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/embeddings\n",
      "Uploaded embeddings to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/embeddings\n"
     ]
    }
   ],
   "source": [
    "embedding_upload_path = s3_path_join(\"s3://\",bucket_name,s3_prefix, 'embeddings')\n",
    "print(f\"Uploading embeddings to {embedding_upload_path}\")\n",
    "data_uri = S3Uploader.upload(embeddings_filename, embedding_upload_path)\n",
    "print(f\"Uploaded embeddings to {embedding_upload_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cluster titles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.sklearn.processing import SKLearnProcessor\n",
    "from sagemaker.processing import ProcessingInput, ProcessingOutput"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn_processor_spectral_clustering = SKLearnProcessor(framework_version='1.0-1',\n",
    "                                                         role=sagemaker_execution_role,\n",
    "                                                         instance_type='ml.m5.2xlarge',\n",
    "                                                         instance_count=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Job Name:  sagemaker-scikit-learn-2023-03-14-02-20-48-009\n",
      "Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/embeddings', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-230294632802/sagemaker-scikit-learn-2023-03-14-02-20-48-009/input/code/SpectralClustering.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]\n",
      "Outputs:  [{'OutputName': 'titles_clusters', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/results/clusters', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]\n",
      "..........................\u001b[34mArgs:Namespace(affinity='cosine', assign_labels='kmeans', n_clusters=20, n_init=100, n_neighbors=10, random_state=None)\u001b[0m\n",
      "\u001b[34mLoading embeddings\u001b[0m\n",
      "\u001b[34mPerforming clustering\u001b[0m\n",
      "\u001b[34mSaving clusters!\u001b[0m\n",
      "\n"
     ]
    }
   ],
   "source": [
    "output_destination = os.path.join('s3://', bucket_name, s3_prefix, \"results\", \"clusters\")\n",
    "\n",
    "sklearn_processor_spectral_clustering.run(\n",
    "    code=\"./scikit-sagemaker-clustering/SpectralClustering.py\",\n",
    "    inputs=[ProcessingInput(source=embedding_upload_path, destination=\"/opt/ml/processing/input\")],\n",
    "    outputs=[ProcessingOutput(output_name=\"titles_clusters\", source=\"/opt/ml/processing/output\", destination=output_destination)],\n",
    "    arguments=[\"--n-clusters\", str(n_clusters),\n",
    "               \"--n-init\", \"100\",\n",
    "               \"--affinity\", \"cosine\",\n",
    "               \"--n-neighbors\", \"10\",\n",
    "               \"--assign-labels\", \"kmeans\"\n",
    "              ],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading cluster data to .\n",
      "Downloaded cluster data to .\n"
     ]
    }
   ],
   "source": [
    "#Download the results. \n",
    "#The batch clustering job (step above) must have finished before you can run this cell.\n",
    "\n",
    "clusters_file = 'clustered_blog_titles_with_embeddings.csv'\n",
    "\n",
    "clusters_data_path = s3_path_join(\"s3://\", bucket_name, s3_prefix, \"results\", \"clusters\", clusters_file)\n",
    "print(f\"Downloading cluster data to .\")\n",
    "S3Downloader.download(clusters_data_path,'.')\n",
    "print(f\"Downloaded cluster data to .\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Automatic cluster labeling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters_df = pd.read_csv(clusters_file)\n",
    "clusters_titles = clusters_df[['blog_title_lemmatized', 'cluster_label']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>blog_title_lemmatized</th>\n",
       "      <th>cluster_label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10 000 sheep collaborative art project</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10 best practices to help partners build aws q...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10 things serverless architects should know</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10 years of success aws and alert logic</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10 years of success aws and appian</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23125</th>\n",
       "      <td>re invent bonus content m e focused sessions o...</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23126</th>\n",
       "      <td>re invent open source highlights week 1</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23127</th>\n",
       "      <td>redbus building a data platform with aws apach...</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23128</th>\n",
       "      <td>s2n and lucky 13</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23129</th>\n",
       "      <td>wefoxgroup s migration to aws and amazon eks</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>23130 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                   blog_title_lemmatized  cluster_label\n",
       "0                 10 000 sheep collaborative art project             19\n",
       "1      10 best practices to help partners build aws q...              2\n",
       "2            10 things serverless architects should know             13\n",
       "3                10 years of success aws and alert logic             18\n",
       "4                     10 years of success aws and appian             18\n",
       "...                                                  ...            ...\n",
       "23125  re invent bonus content m e focused sessions o...             15\n",
       "23126            re invent open source highlights week 1             15\n",
       "23127  redbus building a data platform with aws apach...             17\n",
       "23128                                   s2n and lucky 13             19\n",
       "23129       wefoxgroup s migration to aws and amazon eks             18\n",
       "\n",
       "[23130 rows x 2 columns]"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters_titles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters = [clusters_titles.loc[clusters_titles.cluster_label == i, 'blog_title_lemmatized'].to_list() for i in range(0, n_clusters)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters_tf_idf = []\n",
    "clusters_tf_idf_terms = []\n",
    "clusters_tags = []\n",
    "clusters_keywords_tf_idf = []\n",
    "tf_idf_threshold = 0.2\n",
    "\n",
    "for cluster in clusters:\n",
    "\n",
    "    tfIdfVectorizer = TfidfVectorizer(use_idf=True)\n",
    "    tfIdf = tfIdfVectorizer.fit_transform(cluster)\n",
    "    tf_idf_df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=[\"TF-IDF\"])\n",
    "    tf_idf_df = tf_idf_df.sort_values('TF-IDF', ascending=False)\n",
    "    \n",
    "    clusters_tf_idf.append(tf_idf_df)\n",
    "    \n",
    "    cluster_tf_idf_terms = list(tf_idf_df.loc[tf_idf_df['TF-IDF'] > tf_idf_threshold].index.values)\n",
    "    clusters_tf_idf_terms.append(cluster_tf_idf_terms)\n",
    "    \n",
    "    tags = nltk.pos_tag(cluster_tf_idf_terms)\n",
    "    clusters_tags.append(tags)\n",
    "    \n",
    "    keywords = [tag[0] for tag in tags if tag[1] in ['NN', 'NNS'] and tag[0] not in ['aws', 'amazon']]\n",
    "    clusters_keywords_tf_idf.append(keywords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['blockchain', 'datasets', 'data'],\n",
       " ['alces', 'flight', 'series', 'sector', 'university', 'power', 'research'],\n",
       " ['starts', 'practices', 'help', 'customers'],\n",
       " ['visualizations', 'data', 'quicksight'],\n",
       " ['success', 'years'],\n",
       " ['turbines', 'drone', 'wind', 'inspection', 'ai', 'driven'],\n",
       " ['simpler', 'sam', 'experience', 'deployment', 'cli'],\n",
       " ['feedback', 'whitepapers', 'videos', 'articles', 'year'],\n",
       " ['backup', 'ways', 'plans', 'rules'],\n",
       " ['patient', 'health', 'part', 'building', 'digital'],\n",
       " ['innovation', 'years'],\n",
       " ['switzerland', 'isae', 'finma', 'attestation', 'type', 'report'],\n",
       " ['experience', 'style', 'shop', 'pytorch'],\n",
       " ['architects', 'things'],\n",
       " ['review', 'year', 'mongodb', 'compatibility'],\n",
       " ['catalog', 'session', 'compliance', 'security'],\n",
       " ['decade', 'iops', 'ebs'],\n",
       " ['things', 'compatibility', 'mongodb'],\n",
       " ['years', 'success'],\n",
       " ['art', 'project']]"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters_keywords_tf_idf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters_df['categories'] = clusters_titles['cluster_label'].map(lambda i: clusters_keywords_tf_idf[i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>blog_title_lemmatized</th>\n",
       "      <th>cluster_label</th>\n",
       "      <th>categories</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>48</th>\n",
       "      <td>22 new or updated open datasets on aws new pol...</td>\n",
       "      <td>0</td>\n",
       "      <td>[blockchain, datasets, data]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>3 gain insights from complex data featuring 3m</td>\n",
       "      <td>0</td>\n",
       "      <td>[blockchain, datasets, data]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>63</th>\n",
       "      <td>4 steps to train and deploy machine learning m...</td>\n",
       "      <td>0</td>\n",
       "      <td>[blockchain, datasets, data]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>102</th>\n",
       "      <td>70 datasets inspire winning ideas to tackle op...</td>\n",
       "      <td>0</td>\n",
       "      <td>[blockchain, datasets, data]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>105</th>\n",
       "      <td>890 by capgemini with aws powering business de...</td>\n",
       "      <td>0</td>\n",
       "      <td>[blockchain, datasets, data]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22916</th>\n",
       "      <td>what s around the turn in 2021 aws deepracer l...</td>\n",
       "      <td>0</td>\n",
       "      <td>[blockchain, datasets, data]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22972</th>\n",
       "      <td>why our customers love amazon machine learning...</td>\n",
       "      <td>0</td>\n",
       "      <td>[blockchain, datasets, data]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22989</th>\n",
       "      <td>why use docker containers for machine learning...</td>\n",
       "      <td>0</td>\n",
       "      <td>[blockchain, datasets, data]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22994</th>\n",
       "      <td>will spark power the data behind precision med...</td>\n",
       "      <td>0</td>\n",
       "      <td>[blockchain, datasets, data]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23050</th>\n",
       "      <td>yewno uses aws and ml to analyze vast amounts ...</td>\n",
       "      <td>0</td>\n",
       "      <td>[blockchain, datasets, data]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1060 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                   blog_title_lemmatized  cluster_label  \\\n",
       "48     22 new or updated open datasets on aws new pol...              0   \n",
       "53        3 gain insights from complex data featuring 3m              0   \n",
       "63     4 steps to train and deploy machine learning m...              0   \n",
       "102    70 datasets inspire winning ideas to tackle op...              0   \n",
       "105    890 by capgemini with aws powering business de...              0   \n",
       "...                                                  ...            ...   \n",
       "22916  what s around the turn in 2021 aws deepracer l...              0   \n",
       "22972  why our customers love amazon machine learning...              0   \n",
       "22989  why use docker containers for machine learning...              0   \n",
       "22994  will spark power the data behind precision med...              0   \n",
       "23050  yewno uses aws and ml to analyze vast amounts ...              0   \n",
       "\n",
       "                         categories  \n",
       "48     [blockchain, datasets, data]  \n",
       "53     [blockchain, datasets, data]  \n",
       "63     [blockchain, datasets, data]  \n",
       "102    [blockchain, datasets, data]  \n",
       "105    [blockchain, datasets, data]  \n",
       "...                             ...  \n",
       "22916  [blockchain, datasets, data]  \n",
       "22972  [blockchain, datasets, data]  \n",
       "22989  [blockchain, datasets, data]  \n",
       "22994  [blockchain, datasets, data]  \n",
       "23050  [blockchain, datasets, data]  \n",
       "\n",
       "[1060 rows x 3 columns]"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters_df.loc[clusters_df['cluster_label']==0, ['blog_title_lemmatized', 'cluster_label', 'categories']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters_categories_file = 'aws_blog_titles_clusters_categories.csv'\n",
    "clusters_df.to_csv(clusters_categories_file, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Uploading clusters to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/results/clusters\n",
      "Uploaded clusters to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/results/clusters\n"
     ]
    }
   ],
   "source": [
    "clusters_data_path = s3_path_join(\"s3://\", bucket_name, s3_prefix, \"results\", \"clusters\")\n",
    "print(f\"Uploading clusters to {clusters_data_path}\")\n",
    "clusters_file_uri = S3Uploader.upload(clusters_categories_file, clusters_data_path)\n",
    "print(f\"Uploaded clusters to {clusters_data_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize the clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  after removing the cwd from sys.path.\n"
     ]
    }
   ],
   "source": [
    "cluster_sample_df = clusters_df.sample(n=1000).reset_index()\n",
    "title_embeddings_sample = cluster_sample_df.iloc[:,:-3]\n",
    "clusters_titles_sample = cluster_sample_df[['blog_title_lemmatized', 'cluster_label', 'categories']]\n",
    "clusters_titles_sample['short_categories'] = clusters_titles_sample['categories'].map(lambda x: x[:2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>blog_title_lemmatized</th>\n",
       "      <th>cluster_label</th>\n",
       "      <th>categories</th>\n",
       "      <th>short_categories</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>new aws lambda scaling controls for kinesis an...</td>\n",
       "      <td>5</td>\n",
       "      <td>[turbines, drone, wind, inspection, ai, driven]</td>\n",
       "      <td>[turbines, drone]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>high growth innovation powered by technology</td>\n",
       "      <td>9</td>\n",
       "      <td>[patient, health, part, building, digital]</td>\n",
       "      <td>[patient, health]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>aws java dao integration project</td>\n",
       "      <td>18</td>\n",
       "      <td>[years, success]</td>\n",
       "      <td>[years, success]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>aws online tech talks november 2017</td>\n",
       "      <td>18</td>\n",
       "      <td>[years, success]</td>\n",
       "      <td>[years, success]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>icymi new stories updates and resources from a...</td>\n",
       "      <td>2</td>\n",
       "      <td>[starts, practices, help, customers]</td>\n",
       "      <td>[starts, practices]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>announcing self service blacklisted address re...</td>\n",
       "      <td>8</td>\n",
       "      <td>[backup, ways, plans, rules]</td>\n",
       "      <td>[backup, ways]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>improving daemon services in amazon ecs</td>\n",
       "      <td>16</td>\n",
       "      <td>[decade, iops, ebs]</td>\n",
       "      <td>[decade, iops]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>improve your website availability with amazon ...</td>\n",
       "      <td>4</td>\n",
       "      <td>[success, years]</td>\n",
       "      <td>[success, years]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>how to package cookbook dependencies locally w...</td>\n",
       "      <td>16</td>\n",
       "      <td>[decade, iops, ebs]</td>\n",
       "      <td>[decade, iops]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>now hiring aws solutions architects</td>\n",
       "      <td>18</td>\n",
       "      <td>[years, success]</td>\n",
       "      <td>[years, success]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1000 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 blog_title_lemmatized  cluster_label  \\\n",
       "0    new aws lambda scaling controls for kinesis an...              5   \n",
       "1         high growth innovation powered by technology              9   \n",
       "2                     aws java dao integration project             18   \n",
       "3                  aws online tech talks november 2017             18   \n",
       "4    icymi new stories updates and resources from a...              2   \n",
       "..                                                 ...            ...   \n",
       "995  announcing self service blacklisted address re...              8   \n",
       "996            improving daemon services in amazon ecs             16   \n",
       "997  improve your website availability with amazon ...              4   \n",
       "998  how to package cookbook dependencies locally w...             16   \n",
       "999                now hiring aws solutions architects             18   \n",
       "\n",
       "                                          categories     short_categories  \n",
       "0    [turbines, drone, wind, inspection, ai, driven]    [turbines, drone]  \n",
       "1         [patient, health, part, building, digital]    [patient, health]  \n",
       "2                                   [years, success]     [years, success]  \n",
       "3                                   [years, success]     [years, success]  \n",
       "4               [starts, practices, help, customers]  [starts, practices]  \n",
       "..                                               ...                  ...  \n",
       "995                     [backup, ways, plans, rules]       [backup, ways]  \n",
       "996                              [decade, iops, ebs]       [decade, iops]  \n",
       "997                                 [success, years]     [success, years]  \n",
       "998                              [decade, iops, ebs]       [decade, iops]  \n",
       "999                                 [years, success]     [years, success]  \n",
       "\n",
       "[1000 rows x 4 columns]"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clusters_titles_sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters_tsne = TSNE(perplexity=13, n_components=2, init='pca', n_iter=5000)\n",
    "tsne_embeddings = clusters_tsne.fit_transform(title_embeddings_sample)\n",
    "tsne_embeddings_df = pd.DataFrame(tsne_embeddings, columns=['x', 'y'])\n",
    "tsne_embeddings_df['cluster'] = clusters_titles_sample['cluster_label']\n",
    "tsne_embeddings_df['labels'] = clusters_titles_sample['categories']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "tsne_embeddings_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colors=[\n",
    "    '#efaf50',\n",
    "    '#a09934',\n",
    "    '#e31ad9',\n",
    "    '#cfcbb0',\n",
    "    '#1224c9',\n",
    "    '#669fa4',\n",
    "    '#087274',\n",
    "    '#787168',\n",
    "    '#3e93cb',\n",
    "    '#722823',\n",
    "    '#c8784c',\n",
    "    '#74ac48',\n",
    "    '#c31033',\n",
    "    '#5acc21',\n",
    "    '#2ef8ba',\n",
    "    '#c67ebe',\n",
    "    '#805004',\n",
    "    '#a8f43b',\n",
    "    '#442d6d',\n",
    "    '#9141ea',\n",
    "]\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(30,30))\n",
    "ax = sns.scatterplot(data=tsne_embeddings_df, x='x', y='y', hue='cluster', legend='full', palette=colors, ax=ax)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tsne_embeddings_df[['cluster', 'labels']].drop_duplicates('cluster').sort_values('cluster')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.m5.large",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}