{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Create a short text clustering system using AWS SageMaker jumpstart pre-trained transformer models " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. [Introduction](#Introduction) \n", "2. [Setup](#Setup)\n", "3. [Create a model for text embeddings from the Jumpstart solutions library of models](#Create-a-model-for-text-embeddings-from-the-Jumpstart-solutions-library-of-models)\n", "4. [Data pre-processing](#Data-pre-processing)\n", "5. [Create phrase (sentence) embeddings](#Create-phrase-(sentence)-embeddings)\n", "6. [Cluster phrases (sentences)](#Cluster-phrases-(sentences))\n", "7. [Automatic cluster labeling](#Automatic-cluster-labeling)\n", "8. [Batch process the entire dataset](#Batch-process-the-entire-dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we demonstrate how you can cluster short text (phrases) using the pre-trained transformer models on [AWS SageMaker Jumpstart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html). Here we will demonstrate the use of a transformer model called [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli). The model is used to create an embedding of phrases that we will then use to cluster such phrases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by updating the required packages i.e. SageMaker Python SDK, pandas, numpy, etc." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Keyring is skipped due to an exception: 'keyring.backends'\n", "Requirement already satisfied: boto3 in /opt/conda/lib/python3.7/site-packages (1.26.24)\n", "Requirement already satisfied: jsonlines in /opt/conda/lib/python3.7/site-packages (3.1.0)\n", "Requirement already satisfied: seaborn in /opt/conda/lib/python3.7/site-packages (0.10.0)\n", "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from boto3) (0.6.0)\n", "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3) (1.0.1)\n", "Requirement already satisfied: botocore<1.30.0,>=1.29.24 in /opt/conda/lib/python3.7/site-packages (from boto3) (1.29.24)\n", "Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.7/site-packages (from jsonlines) (4.4.0)\n", "Requirement already satisfied: attrs>=19.2.0 in /opt/conda/lib/python3.7/site-packages (from jsonlines) (22.1.0)\n", "Requirement already satisfied: pandas>=0.22.0 in /opt/conda/lib/python3.7/site-packages (from seaborn) (1.3.5)\n", "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from seaborn) (1.21.6)\n", "Requirement already satisfied: scipy>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from seaborn) (1.4.1)\n", "Requirement already satisfied: matplotlib>=2.1.2 in /opt/conda/lib/python3.7/site-packages (from seaborn) (3.1.3)\n", "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.24->boto3) (1.26.13)\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.24->boto3) (2.8.2)\n", "Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (1.1.0)\n", "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (2.4.6)\n", "Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (0.10.0)\n", "Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn) (2019.3)\n", "Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from cycler>=0.10->matplotlib>=2.1.2->seaborn) (1.14.0)\n", "Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->seaborn) (59.3.0)\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" ] } ], "source": [ "!pip install boto3 jsonlines seaborn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **Note: Restart the notebook's kernel after installing the above packages.**" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import sagemaker\n", "import json\n", "import re\n", "import os\n", "\n", "from sagemaker import get_execution_role\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import math\n", "\n", "import nltk\n", "from nltk.corpus import stopwords\n", "\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.feature_extraction.text import TfidfTransformer \n", "from sklearn.feature_extraction.text import CountVectorizer \n", "\n", "from sklearn.manifold import TSNE\n", "from sklearn.cluster import SpectralClustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use NLTK library to help us with the pre-processing of the data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [], "source": [ "session = boto3.Session()\n", "sagemaker_execution_role = get_execution_role()\n", "s3 = session.resource('s3')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.download('stopwords')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a model for text embeddings from the Jumpstart solutions library of models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use one the text embedding [models available in SageMaker jumpstart](https://sagemaker.readthedocs.io/en/v2.129.0/doc_utils/pretrainedmodels.html)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "#### Chose a model for Inference" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [], "source": [ "#We choose the tensorflow-tcembedding-universal-sentence-encoder-cmlm-en-large-1 as the default model since it is better suited for phrase analysis\n", "\n", "model_id, model_version = (\n", " \"tensorflow-tcembedding-universal-sentence-encoder-cmlm-en-large-1\", \n", " \"*\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "You can continue with the default model, or can choose a different model from the dropdown generated upon running the next cell. A complete list of SageMaker pre-trained models can also be accessed at [Sagemaker pre-trained Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [] }, "outputs": [], "source": [ "from ipywidgets import Dropdown\n", "from sagemaker.jumpstart.notebook_utils import list_jumpstart_models, list_jumpstart_tasks\n", "\n", "# Retrieves all text embedding models.\n", "filter_value = \"task == tcembedding\"\n", "tcembedding_models = list_jumpstart_models(filter=filter_value)\n", "\n", "# display the model-ids in a dropdown to select a model for inference.\n", "model_dropdown = Dropdown(\n", " options=tcembedding_models,\n", " value=model_id,\n", " description=\"Select a model\",\n", " style={\"description_width\": \"initial\"},\n", " layout={\"width\": \"max-content\"},\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ae82576e5f51430da1a128adba6a276d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Select a model', index=33, layout=Layout(width='max-content'), options=('mxnet-tcembeddi…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(model_dropdown)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# model_version=\"*\" fetches the latest version of the model\n", "model_id, model_version = model_dropdown.value, \"*\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sagemaker.utils import name_from_base\n", "\n", "model_name = name_from_base(f\"jumpstart-example-infer-{model_id}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the model from the selected model_id, model_version" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from sagemaker import image_uris, model_uris, script_uris, hyperparameters\n", "from sagemaker.model import Model\n", "from sagemaker.predictor import Predictor\n", "\n", "inference_instance_type = \"ml.m5.xlarge\" #You can change the instance according to your needs\n", "\n", "# Retrieve the inference docker container uri. This is the base Tensorflow container image for the default model above.\n", "deploy_image_uri = image_uris.retrieve(\n", " region=None,\n", " framework=None, # automatically inferred from model_id\n", " image_scope=\"inference\",\n", " model_id=model_id,\n", " model_version=model_version,\n", " instance_type=inference_instance_type,\n", ")\n", "\n", "# Retrieve the inference script uri. This includes all dependencies and scripts for model loading, inference handling etc.\n", "deploy_source_uri = script_uris.retrieve(\n", " model_id=model_id, model_version=model_version, script_scope=\"inference\"\n", ")\n", "\n", "\n", "# Retrieve the model uri. This includes the model and model parameters.\n", "model_uri = model_uris.retrieve(\n", " model_id=model_id, model_version=model_version, model_scope=\"inference\"\n", ")\n", "\n", "\n", "# Create the SageMaker model instance\n", "embedding_model = Model(\n", " image_uri=deploy_image_uri,\n", " source_dir=deploy_source_uri,\n", " model_data=model_uri,\n", " entry_point=\"inference.py\", # entry point file in source_dir and present in deploy_source_uri\n", " role=sagemaker_execution_role,\n", " predictor_cls=Predictor,\n", " name=model_name,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data pre-processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this demonstration we will use a dataset made up of the blog titles for each blog published by AWS from 2004 until late 2022. We use the blog titles to cluster them and assign a topic to each cluster\n", "\n", "The text is pre-processed with the following steps:\n", "\n", "* Set category string to lowercase\n", "* Replace acronyms with actual words \n", "* Replace special word-bound characters such as / and - (i.e.: imagenes/videos, cerveza-vino) to get separate words.\n", "* Eliminate explanations between parenthesis\n", "* Remove any other non-word characters from sentence\n", "* Split sentence into tokens\n", "* Singularize each token" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "blogs_df = pd.read_csv('aws_blog_titles.csv', header=None, names=['URL', 'Title'])\n", "\n", "aws_acronyms_df = pd.read_csv('acronyms.csv', header=None, delimiter=';', names=['acronym', 'meaning'])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
URLTitle
0https://aws.amazon.com/blogs/aws/10000_sheep_col/10 000 Sheep Collaborative Art Project
1https://aws.amazon.com/blogs/apn/10-best-pract...10 Best Practices to Help Partners Build AWS Q...
2https://aws.amazon.com/blogs/architecture/ten-...10 Things Serverless Architects Should Know
3https://aws.amazon.com/blogs/apn/10-years-of-s...10 Years of Success AWS and Alert Logic
4https://aws.amazon.com/blogs/apn/10-years-of-s...10 Years of Success AWS and Appian
.........
23195https://aws.amazon.com/blogs/media/reinvent-bo...re Invent bonus content M E focused sessions o...
23196https://aws.amazon.com/blogs/opensource/reinve...re Invent open source highlights Week 1
23197https://aws.amazon.com/blogs/startups/redbus-b...redBus Building a Data Platform with AWS Apach...
23198https://aws.amazon.com/blogs/security/s2n-and-...s2n and Lucky 13
23199https://aws.amazon.com/blogs/startups/wefoxgro...wefoxgroup s Migration to AWS and Amazon EKS
\n", "

23200 rows × 2 columns

\n", "
" ], "text/plain": [ " URL \\\n", "0 https://aws.amazon.com/blogs/aws/10000_sheep_col/ \n", "1 https://aws.amazon.com/blogs/apn/10-best-pract... \n", "2 https://aws.amazon.com/blogs/architecture/ten-... \n", "3 https://aws.amazon.com/blogs/apn/10-years-of-s... \n", "4 https://aws.amazon.com/blogs/apn/10-years-of-s... \n", "... ... \n", "23195 https://aws.amazon.com/blogs/media/reinvent-bo... \n", "23196 https://aws.amazon.com/blogs/opensource/reinve... \n", "23197 https://aws.amazon.com/blogs/startups/redbus-b... \n", "23198 https://aws.amazon.com/blogs/security/s2n-and-... \n", "23199 https://aws.amazon.com/blogs/startups/wefoxgro... \n", "\n", " Title \n", "0 10 000 Sheep Collaborative Art Project \n", "1 10 Best Practices to Help Partners Build AWS Q... \n", "2 10 Things Serverless Architects Should Know \n", "3 10 Years of Success AWS and Alert Logic \n", "4 10 Years of Success AWS and Appian \n", "... ... \n", "23195 re Invent bonus content M E focused sessions o... \n", "23196 re Invent open source highlights Week 1 \n", "23197 redBus Building a Data Platform with AWS Apach... \n", "23198 s2n and Lucky 13 \n", "23199 wefoxgroup s Migration to AWS and Amazon EKS \n", "\n", "[23200 rows x 2 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blogs_df" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
acronymmeaning
0Amazon ESAmazon Elasticsearch Service
1AMIAmazon Machine Image
2APIApplication Programming Interface
3AIArtificial Intelligence
4ACLAccess Control List
.........
120VPNVirtual Private Network
121VLANVirtual Local Area Network
122VDIVirtual Desktop Infrastructure
123VPGVirtual Private Gateway
124WAFWeb Application Firewall
\n", "

125 rows × 2 columns

\n", "
" ], "text/plain": [ " acronym meaning\n", "0 Amazon ES Amazon Elasticsearch Service\n", "1 AMI Amazon Machine Image\n", "2 API Application Programming Interface\n", "3 AI Artificial Intelligence\n", "4 ACL Access Control List\n", ".. ... ...\n", "120 VPN Virtual Private Network\n", "121 VLAN Virtual Local Area Network\n", "122 VDI Virtual Desktop Infrastructure\n", "123 VPG Virtual Private Gateway\n", "124 WAF Web Application Firewall\n", "\n", "[125 rows x 2 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We transform acronyms to their actual meaning since the transformer may not be aware of them (as it was not trained in this specific vocabulary)\n", "\n", "aws_acronyms_df" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "blogs_df = blogs_df.drop_duplicates(subset=['Title']).reset_index()\n", "blogs_df = blogs_df.drop(columns=['index', 'URL'])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "#For eficiency only take sample_size titles at random\n", "\n", "sample_size = 1000\n", "blogs_df_sample = blogs_df.sample(n=sample_size)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "titles = blogs_df_sample['Title'].tolist()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "lemmatized = []\n", "for title in titles:\n", " sentence = title.lower()\n", " sentence = re.sub(r'[^a-zA-Z0-9_-áéíóúñ ]', r'', sentence) #remove extraneous characters (maybe a different encoding)\n", " lemmatized.append(sentence)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['nova brings data science courses to the usmc with aws educate',\n", " 'cloud native application monitoring for aws',\n", " 'sharing matlab applications on aws using the matlab web app server',\n", " 'amazon elastic file system shared file storage for amazon ec2',\n", " 'duplicating infrastructure on aws',\n", " 'amazon kinesis analytics process streaming data in real time with sql',\n", " 'how to enable 360 degree analytics and innovate faster on aws with datavard glue for sap',\n", " 'amazon monitron a simple and cost effective service enabling predictive maintenance',\n", " 'new aws partner program launches and updates announced at re invent 2020',\n", " 'online tech talk april 23 persistent storage for containers with amazon efs']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lemmatized[:10]" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Create phrase (sentence) embeddings" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# These functions are used to query the endpoint and parse the response\n", "\n", "def query(model_predictor, text):\n", " \"\"\"Query the model predictor.\"\"\"\n", "\n", " encoded_text = json.dumps(text).encode(\"utf-8\")\n", "\n", " query_response = model_predictor.predict(\n", " encoded_text,\n", " {\n", " \"ContentType\": \"application/x-text\",\n", " \"Accept\": \"application/json\",\n", " },\n", " )\n", " return query_response\n", "\n", "\n", "def parse_response(query_response):\n", " \"\"\"Parse response and return the embedding.\"\"\"\n", "\n", " model_predictions = json.loads(query_response)\n", " embedding = model_predictions[\"embedding\"]\n", " return embedding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deploy the selected model to an endpoint for real time inference" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-----!" ] } ], "source": [ "# deploy the Model. Note that we need to pass Predictor class when we deploy model through Model class,\n", "# for being able to run inference through the sagemaker API.\n", "model_predictor = embedding_model.deploy(\n", " initial_instance_count=1,\n", " instance_type=inference_instance_type,\n", " predictor_cls=Predictor,\n", " endpoint_name=model_name,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate embeddings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the deployed model to generate the embeddings for each of the titles in our sample dataset" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "#model_predictor = Predictor('jumpstart-example-infer-tensorflow-tcem-2023-01-19-23-23-44-619') #Specifiy endpoint name in case you wanna use an already deployed endpoint" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.37 s, sys: 142 ms, total: 2.52 s\n", "Wall time: 4min 59s\n" ] } ], "source": [ "%%time\n", "sentence_vectors = [parse_response(query(model_predictor, title)) for title in lemmatized]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "encoded_titles_df = pd.DataFrame(sentence_vectors)\n", "encoded_titles_df['blog_title_lemmatized'] = lemmatized" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...101510161017101810191020102110221023blog_title_lemmatized
0-1.2592440.1021270.089923-0.820289-0.223537-0.4328970.799307-0.674757-0.252520-0.272772...0.141702-0.3929420.119546-0.386988-0.5684770.205832-0.170402-1.0257660.198024nova brings data science courses to the usmc w...
1-0.6929620.259012-0.748741-0.866656-0.901526-0.9977680.377633-0.053566-0.5037520.144736...0.1141720.5220530.834067-0.771977-0.4666120.195516-0.7157550.326673-0.530868cloud native application monitoring for aws
2-0.6527420.1266230.094145-0.545012-0.761059-0.479529-0.1198980.001302-0.3473510.485530...0.158906-0.3141140.502422-0.112350-0.143338-0.790585-0.104163-0.003831-0.414520sharing matlab applications on aws using the m...
30.157903-1.218128-0.358249-0.143645-0.7859350.1116890.451628-0.0401830.2325290.813196...0.574929-0.5959770.7333600.2285050.4845460.407535-0.6560690.4651970.631824amazon elastic file system shared file storage...
4-0.1997680.498280-0.096403-1.235060-0.164219-0.747414-0.2802070.717804-1.2013740.192102...-0.4228360.0487280.729825-0.9742790.4520090.646301-0.7311220.526780-0.412448duplicating infrastructure on aws
..................................................................
995-0.1905240.1905220.247199-0.909789-0.532487-0.958266-0.366771-0.469838-0.685504-0.749760...-0.4820150.3583460.935664-0.9442320.3747910.1388230.717717-0.5817090.456586aws week in review august 25 2014
996-0.524172-0.5085150.474869-1.093429-1.255689-0.6884240.8987230.750922-1.1547381.334882...0.765626-0.2403150.339106-0.1428250.226154-0.350119-0.670260-0.665765-0.013468optimizing performance for users in china with...
997-0.4047310.581385-0.491551-0.588944-0.035981-0.5298260.4599390.522350-0.7747180.081250...0.718470-0.0133231.003814-0.764670-0.3642410.231567-0.153268-0.106582-1.071888aws accelerator for citrix migrate or deploy x...
9980.3075070.6467970.672350-0.879446-0.829220-0.1739030.130434-0.2285250.172912-0.866439...-0.857214-0.3249041.445891-0.198214-0.1046840.646890-0.060024-0.3841430.098519in case you missed it september 2019 top blog ...
9990.926491-0.4433660.707322-0.335114-1.174717-0.1235560.9198390.6483310.875201-0.280192...0.192803-0.1631930.983904-0.5195470.6765460.539295-0.2934650.3788260.312693why avatars are usually awful and how snappr f...
\n", "

1000 rows × 1025 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 \\\n", "0 -1.259244 0.102127 0.089923 -0.820289 -0.223537 -0.432897 0.799307 \n", "1 -0.692962 0.259012 -0.748741 -0.866656 -0.901526 -0.997768 0.377633 \n", "2 -0.652742 0.126623 0.094145 -0.545012 -0.761059 -0.479529 -0.119898 \n", "3 0.157903 -1.218128 -0.358249 -0.143645 -0.785935 0.111689 0.451628 \n", "4 -0.199768 0.498280 -0.096403 -1.235060 -0.164219 -0.747414 -0.280207 \n", ".. ... ... ... ... ... ... ... \n", "995 -0.190524 0.190522 0.247199 -0.909789 -0.532487 -0.958266 -0.366771 \n", "996 -0.524172 -0.508515 0.474869 -1.093429 -1.255689 -0.688424 0.898723 \n", "997 -0.404731 0.581385 -0.491551 -0.588944 -0.035981 -0.529826 0.459939 \n", "998 0.307507 0.646797 0.672350 -0.879446 -0.829220 -0.173903 0.130434 \n", "999 0.926491 -0.443366 0.707322 -0.335114 -1.174717 -0.123556 0.919839 \n", "\n", " 7 8 9 ... 1015 1016 1017 \\\n", "0 -0.674757 -0.252520 -0.272772 ... 0.141702 -0.392942 0.119546 \n", "1 -0.053566 -0.503752 0.144736 ... 0.114172 0.522053 0.834067 \n", "2 0.001302 -0.347351 0.485530 ... 0.158906 -0.314114 0.502422 \n", "3 -0.040183 0.232529 0.813196 ... 0.574929 -0.595977 0.733360 \n", "4 0.717804 -1.201374 0.192102 ... -0.422836 0.048728 0.729825 \n", ".. ... ... ... ... ... ... ... \n", "995 -0.469838 -0.685504 -0.749760 ... -0.482015 0.358346 0.935664 \n", "996 0.750922 -1.154738 1.334882 ... 0.765626 -0.240315 0.339106 \n", "997 0.522350 -0.774718 0.081250 ... 0.718470 -0.013323 1.003814 \n", "998 -0.228525 0.172912 -0.866439 ... -0.857214 -0.324904 1.445891 \n", "999 0.648331 0.875201 -0.280192 ... 0.192803 -0.163193 0.983904 \n", "\n", " 1018 1019 1020 1021 1022 1023 \\\n", "0 -0.386988 -0.568477 0.205832 -0.170402 -1.025766 0.198024 \n", "1 -0.771977 -0.466612 0.195516 -0.715755 0.326673 -0.530868 \n", "2 -0.112350 -0.143338 -0.790585 -0.104163 -0.003831 -0.414520 \n", "3 0.228505 0.484546 0.407535 -0.656069 0.465197 0.631824 \n", "4 -0.974279 0.452009 0.646301 -0.731122 0.526780 -0.412448 \n", ".. ... ... ... ... ... ... \n", "995 -0.944232 0.374791 0.138823 0.717717 -0.581709 0.456586 \n", "996 -0.142825 0.226154 -0.350119 -0.670260 -0.665765 -0.013468 \n", "997 -0.764670 -0.364241 0.231567 -0.153268 -0.106582 -1.071888 \n", "998 -0.198214 -0.104684 0.646890 -0.060024 -0.384143 0.098519 \n", "999 -0.519547 0.676546 0.539295 -0.293465 0.378826 0.312693 \n", "\n", " blog_title_lemmatized \n", "0 nova brings data science courses to the usmc w... \n", "1 cloud native application monitoring for aws \n", "2 sharing matlab applications on aws using the m... \n", "3 amazon elastic file system shared file storage... \n", "4 duplicating infrastructure on aws \n", ".. ... \n", "995 aws week in review august 25 2014 \n", "996 optimizing performance for users in china with... \n", "997 aws accelerator for citrix migrate or deploy x... \n", "998 in case you missed it september 2019 top blog ... \n", "999 why avatars are usually awful and how snappr f... \n", "\n", "[1000 rows x 1025 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoded_titles_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cluster phrases (sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spectral clsutering is a clustering algorithm based on graph theory. Spectral clustering uses information from the eigenvalues (spectrum) of the Laplacian matrix built from the graph or the data set to create groups (clusters) of data. Spectral clustering requires a measures of affinity between data points, for this application we use cosine affinity because we are interested in sentences that lie near to each other but also with \"similar meaning\"." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "n_clusters = 20\n", "\n", "clustering_model = SpectralClustering(n_clusters=n_clusters, n_init=100, affinity='cosine', n_neighbors=10, assign_labels=\"kmeans\", random_state=0)\n", "embeddings = encoded_titles_df[encoded_titles_df.columns[0:-1]]\n", "encoded_titles_df['cluster'] = clustering_model.fit_predict(embeddings)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "clusters_titles = encoded_titles_df[['blog_title_lemmatized', 'cluster']]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
blog_title_lemmatizedcluster
0nova brings data science courses to the usmc w...9
1cloud native application monitoring for aws17
2sharing matlab applications on aws using the m...0
3amazon elastic file system shared file storage...5
4duplicating infrastructure on aws17
.........
995aws week in review august 25 20146
996optimizing performance for users in china with...8
997aws accelerator for citrix migrate or deploy x...17
998in case you missed it september 2019 top blog ...6
999why avatars are usually awful and how snappr f...9
\n", "

1000 rows × 2 columns

\n", "
" ], "text/plain": [ " blog_title_lemmatized cluster\n", "0 nova brings data science courses to the usmc w... 9\n", "1 cloud native application monitoring for aws 17\n", "2 sharing matlab applications on aws using the m... 0\n", "3 amazon elastic file system shared file storage... 5\n", "4 duplicating infrastructure on aws 17\n", ".. ... ...\n", "995 aws week in review august 25 2014 6\n", "996 optimizing performance for users in china with... 8\n", "997 aws accelerator for citrix migrate or deploy x... 17\n", "998 in case you missed it september 2019 top blog ... 6\n", "999 why avatars are usually awful and how snappr f... 9\n", "\n", "[1000 rows x 2 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters_titles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automatic cluster labeling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use TF-IDF for finding the keywords in each of our clusters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Text Frequency - Inverse Document Frequency is an NLP technique used to find the most relevant terms in set of documents (phrases in our case). From each cluster we extract its most relevant terms (nouns only) according to TF-IDF and use those as labels/categories for that cluster" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "clusters = [clusters_titles.loc[clusters_titles.cluster == i, 'blog_title_lemmatized'].to_list() for i in range(0,n_clusters)]" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "clusters_tf_idf = []\n", "clusters_tf_idf_terms = []\n", "clusters_tags = []\n", "clusters_keywords_tf_idf = []\n", "tf_idf_threshold = 0.2\n", "\n", "for cluster in clusters:\n", "\n", " tfIdfVectorizer = TfidfVectorizer(use_idf=True)\n", " tfIdf = tfIdfVectorizer.fit_transform(cluster)\n", " tf_idf_df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=[\"TF-IDF\"])\n", " tf_idf_df = tf_idf_df.sort_values('TF-IDF', ascending=False)\n", " \n", " clusters_tf_idf.append(tf_idf_df)\n", " \n", " cluster_tf_idf_terms = list(tf_idf_df.loc[tf_idf_df['TF-IDF'] > tf_idf_threshold].index.values)\n", " clusters_tf_idf_terms.append(cluster_tf_idf_terms)\n", " \n", " tags = nltk.pos_tag(cluster_tf_idf_terms)\n", " clusters_tags.append(tags)\n", " \n", " keywords = [tag[0] for tag in tags if tag[1] in ['NN', 'NNS'] and tag[0] not in ['aws', 'amazon']]\n", " clusters_keywords_tf_idf.append(keywords)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "clusters_idf = []\n", "\n", "for cluster in clusters:\n", "\n", " cv = CountVectorizer() \n", " word_count_vector = cv.fit_transform(cluster)\n", "\n", " tfidf_transformer = TfidfTransformer(smooth_idf=True,use_idf=True) \n", " tfidf_transformer.fit(word_count_vector)\n", "\n", " df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=[\"idf_weights\"])\n", " df_idf['word'] = cv.get_feature_names()\n", "\n", " df_idf = df_idf.sort_values(by='idf_weights')\n", " clusters_idf.append(df_idf)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['matlab', 'applications', 'app'],\n", " ['gen', 'adopters', 'problems'],\n", " ['model', 'register', 'security'],\n", " ['tool', 'risks', 'manage', 'systems'],\n", " ['messaging'],\n", " ['file', 'storage', 'system'],\n", " ['september', 'week', 'review'],\n", " ['connectivity', 'options', 'gigabit'],\n", " ['quality', 'call', 'connect', 'detection', 'time', 'opensearch'],\n", " ['educate', 'nova', 'brings', 'courses', 'science', 'data'],\n", " ['service', 'monitron', 'simple', 'maintenance', 'cost'],\n", " ['service', 'database'],\n", " ['uploads', 's3'],\n", " ['process', 'streaming', 'time', 'analytics', 'kinesis'],\n", " ['reports', 'soc'],\n", " ['sydney', 'university', 'technology', 'stroke', 'rehabilitation', 'robots'],\n", " ['jenkins', 'party', 'create', 'source', 'projects', 'control'],\n", " ['application', 'monitoring', 'cloud'],\n", " ['efs', 'talk', 'tech', 'online', 'containers', 'storage'],\n", " ['program', 'updates', 'partner']]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters_keywords_tf_idf" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idf_weightsword
aws1.117783aws
the1.492476the
for1.750306for
sdk1.944462sdk
with1.944462with
and2.098612and
to2.386294to
using2.504077using
android2.791759android
on2.791759on
in2.791759in
an2.791759an
console2.974081console
app2.974081app
application2.974081application
amazon2.974081amazon
new2.974081new
open3.197225open
release3.197225release
build3.197225build
\n", "
" ], "text/plain": [ " idf_weights word\n", "aws 1.117783 aws\n", "the 1.492476 the\n", "for 1.750306 for\n", "sdk 1.944462 sdk\n", "with 1.944462 with\n", "and 2.098612 and\n", "to 2.386294 to\n", "using 2.504077 using\n", "android 2.791759 android\n", "on 2.791759 on\n", "in 2.791759 in\n", "an 2.791759 an\n", "console 2.974081 console\n", "app 2.974081 app\n", "application 2.974081 application\n", "amazon 2.974081 amazon\n", "new 2.974081 new\n", "open 3.197225 open\n", "release 3.197225 release\n", "build 3.197225 build" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters_idf[0][:20]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "clusters_keywords_idf = []\n", "stop_words = stopwords.words('english')\n", "\n", "clusters_idf_words = [cluster['word'].to_list()[0:10] for cluster in clusters_idf]\n", "s = [ word for word in stop_words if word != 're'] #Remove stopwords but the word re (for re: invent)\n", "\n", "for cluster in clusters_idf_words:\n", " tags = nltk.pos_tag(cluster)\n", " words = [ tag[0] for tag in tags if tag[1] in ['NN', 'NNS'] and tag[0] not in ['aws', 'amazon']]\n", " \n", " clusters_keywords_idf.append(\",\".join(words))" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sdk,android',\n", " '',\n", " 'security,summit',\n", " 'manager,systems,management',\n", " '',\n", " 'instances,instance',\n", " 'review,week,september,part',\n", " 'partners',\n", " 'service,rekognition',\n", " 'data',\n", " 'cloud,data,management',\n", " 'rds,database',\n", " 'cloudformation,update',\n", " 'data,redshift,dynamodb,s3',\n", " 'source,time',\n", " '',\n", " 'sagemaker,model,inference,data',\n", " 'workloads',\n", " '',\n", " 're,invent,guide']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters_keywords_idf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Batch process the entire dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we will create batch processing jobs to process the entire dataset (roughly 24K titles)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "import io\n", "import jsonlines\n", "\n", "from sagemaker.s3 import S3Downloader,S3Uploader,s3_path_join\n", "\n", "n_clusters = 20" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "bucket_name = 'unsupervised-phrase-clustering-with-sagemaker' #\n", "s3_prefix = 'text-clustering-with-transformers/data'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Chose a model for Inference" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "#We choose the tensorflow-tcembedding-universal-sentence-encoder-cmlm-en-large-1 as the default model since it is better suited for phrase analysis\n", "\n", "model_id, model_version = (\n", " \"tensorflow-tcembedding-universal-sentence-encoder-cmlm-en-large-1\", \n", " \"*\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can continue with the default model, or can choose a different model from the dropdown generated upon running the next cell. A complete list of SageMaker pre-trained models can also be accessed at [Sagemaker pre-trained Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the model from the selected model_id, model_version" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "from sagemaker.utils import name_from_base\n", "\n", "model_name = name_from_base(f\"jumpstart-example-infer-gpu-{model_id}\")" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "from sagemaker import image_uris, model_uris, script_uris\n", "from sagemaker.model import Model\n", "from sagemaker.predictor import Predictor\n", "\n", "batch_transform_instance_type = \"ml.g4dn.xlarge\"\n", "\n", "# Retrieve the inference docker container uri. This is the base Tensorflow container image for the default model above.\n", "deploy_image_uri = image_uris.retrieve(\n", " region=None,\n", " framework=None, # automatically inferred from model_id\n", " image_scope=\"inference\",\n", " model_id=model_id,\n", " model_version=model_version,\n", " instance_type=batch_transform_instance_type,\n", ")\n", "\n", "# Retrieve the inference script uri. This includes all dependencies and scripts for model loading, inference handling etc.\n", "deploy_source_uri = script_uris.retrieve(\n", " model_id=model_id, model_version=model_version, script_scope=\"inference\"\n", ")\n", "\n", "\n", "# Retrieve the model uri. This includes the model and model parameters.\n", "model_uri = model_uris.retrieve(\n", " model_id=model_id, model_version=model_version, model_scope=\"inference\"\n", ")\n", "\n", "\n", "# Create the SageMaker model instance\n", "batch_transform_embedding_model = Model(\n", " image_uri=deploy_image_uri,\n", " source_dir=deploy_source_uri,\n", " model_data=model_uri,\n", " entry_point=\"inference.py\", # entry point file in source_dir and present in deploy_source_uri\n", " role=sagemaker_execution_role,\n", " name=model_name,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data preprocessing" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "blogs_df = pd.read_csv('aws_blog_titles.csv', header=None, names=['URL', 'Title'])\n", "aws_acronyms_df = pd.read_csv('acronyms.csv', header=None, delimiter=';', names=['acronym', 'meaning'])\n", "blogs_df = blogs_df.drop_duplicates(subset=['Title']).reset_index()\n", "blogs_df = blogs_df.drop(columns=['index', 'URL'])\n", "titles = blogs_df['Title'].tolist()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "lemmatized = []\n", "for title in titles:\n", " sentence = title.lower()\n", " sentence = re.sub(r'[^a-zA-Z0-9_-áéíóúñ ]', r'', sentence) #remove extraneous characters (maybe a different encoding)\n", " lemmatized.append(sentence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload the pre-processed data to S3" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "batch_filename = 'aws_blog_titles.jsonl'" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "with open(batch_filename, \"wb\") as txt_file:\n", " for title in lemmatized:\n", " \n", " txt_file.write(json.dumps(title).encode(\"utf-8\"))\n", " txt_file.write(\"\\n\".encode('utf-8'))" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Uploading data to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/raw\n", "Uploaded data to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/raw\n" ] } ], "source": [ "data_upload_path = s3_path_join(\"s3://\",bucket_name,s3_prefix, 'raw')\n", "print(f\"Uploading data to {data_upload_path}\")\n", "data_uri = S3Uploader.upload(batch_filename, data_upload_path)\n", "print(f\"Uploaded data to {data_upload_path}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate embeddings" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "# create transformer to run a batch job\n", "\n", "output_path = s3_path_join(\"s3://\", bucket_name, s3_prefix, \"results\", \"embeddings\")\n", "\n", "batch_job = batch_transform_embedding_model.transformer(\n", " instance_count=1,\n", " instance_type=batch_transform_instance_type,\n", " strategy='SingleRecord',\n", " assemble_with='Line',\n", " output_path=output_path,\n", ")" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "# Starts batch transform job and uses S3 data as input. Enable the logs and wait only if you pass a small number of samples (< 100).\n", "# You can monitor your batch processing job from the SageMaker Console -> Inference -> Batch transform jobs\n", "batch_job.transform(\n", " data=data_upload_path,\n", " content_type='application/x-text', \n", " split_type='Line',\n", " logs=False,\n", " wait=False\n", ")" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading embeddings to .\n", "Downloaded embeddings to .\n" ] } ], "source": [ "#Download the results. \n", "#The batch transformation job (step above) must have finished before you can run this cell.\n", "embedding_data_path = s3_path_join(\"s3://\", bucket_name, s3_prefix, \"results\", \"embeddings\", batch_filename+'.out')\n", "print(f\"Downloading embeddings to .\")\n", "S3Downloader.download(embedding_data_path,'.')\n", "print(f\"Downloaded embeddings to .\")" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "lines = []\n", "\n", "with jsonlines.open(batch_filename+\".out\", mode='r') as reader:\n", " for obj in reader:\n", " lines.append(obj['embedding'])\n", " \n", "results_df = pd.DataFrame(lines)\n", "results_df['blog_title_lemmatized'] = lemmatized" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...101510161017101810191020102110221023blog_title_lemmatized
00.188292-0.390401-0.286212-0.732304-0.841209-1.088366-0.5982131.0238980.821023-1.058815...-0.1759970.2458660.7061020.783085-0.8235471.258304-0.043126-0.472476-0.35132410 000 sheep collaborative art project
1-0.264925-0.1083070.688968-0.680849-0.351448-0.8992820.391024-0.6357010.246767-0.395383...0.6737140.1676501.4761310.409790-0.0500080.5825450.279141-0.205470-0.82035810 best practices to help partners build aws q...
20.3718850.7713791.069450-1.0083370.655574-0.955685-0.7113910.6254570.369845-0.443280...-0.096941-0.0258790.967100-0.2833450.6914510.8821240.534032-0.7393880.37805310 things serverless architects should know
3-0.3013700.8477130.486485-0.280148-1.502466-1.012401-0.0910170.317997-0.872763-0.809212...-0.3925430.0505540.026191-0.436454-0.3203720.507621-0.305170-0.284728-0.77644110 years of success aws and alert logic
40.0298710.2951200.633014-0.412160-1.094922-0.7949950.185795-0.116586-0.548522-1.379145...0.5486480.0109770.174071-0.338245-0.2988920.5998600.190257-0.457519-1.00318710 years of success aws and appian
..................................................................
23125-0.0229621.2567890.323558-0.1997240.1790750.2199970.139460-0.184126-0.289477-0.732770...-0.7509400.3332120.500258-0.642244-0.2510671.796083-0.214922-0.485310-0.257823re invent bonus content m e focused sessions o...
231260.7004330.9530941.036029-0.781870-0.111338-0.4784880.362508-0.4221370.382570-0.086834...-0.2690350.2573870.684636-0.135962-0.2679560.802172-0.122769-0.1623950.466416re invent open source highlights week 1
231270.0965530.9534120.493536-1.0500310.031639-0.762166-0.2219620.420519-0.328862-0.170843...-0.7291300.5186890.3604140.419243-0.2445980.524628-0.373507-0.994924-0.055463redbus building a data platform with aws apach...
23128-0.662509-0.0926240.064623-0.446625-0.261067-0.4373870.7017770.340038-0.293806-0.226297...0.8819360.4503750.2891840.1228120.0635640.710250-0.277890-1.0997300.753118s2n and lucky 13
23129-0.542878-0.2376170.501378-1.1801790.353943-0.340788-0.5389090.480047-0.763953-0.598589...-0.372877-0.1954860.8644770.186859-1.1272511.057047-0.2116750.377588-0.937336wefoxgroup s migration to aws and amazon eks
\n", "

23130 rows × 1025 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 \\\n", "0 0.188292 -0.390401 -0.286212 -0.732304 -0.841209 -1.088366 -0.598213 \n", "1 -0.264925 -0.108307 0.688968 -0.680849 -0.351448 -0.899282 0.391024 \n", "2 0.371885 0.771379 1.069450 -1.008337 0.655574 -0.955685 -0.711391 \n", "3 -0.301370 0.847713 0.486485 -0.280148 -1.502466 -1.012401 -0.091017 \n", "4 0.029871 0.295120 0.633014 -0.412160 -1.094922 -0.794995 0.185795 \n", "... ... ... ... ... ... ... ... \n", "23125 -0.022962 1.256789 0.323558 -0.199724 0.179075 0.219997 0.139460 \n", "23126 0.700433 0.953094 1.036029 -0.781870 -0.111338 -0.478488 0.362508 \n", "23127 0.096553 0.953412 0.493536 -1.050031 0.031639 -0.762166 -0.221962 \n", "23128 -0.662509 -0.092624 0.064623 -0.446625 -0.261067 -0.437387 0.701777 \n", "23129 -0.542878 -0.237617 0.501378 -1.180179 0.353943 -0.340788 -0.538909 \n", "\n", " 7 8 9 ... 1015 1016 1017 \\\n", "0 1.023898 0.821023 -1.058815 ... -0.175997 0.245866 0.706102 \n", "1 -0.635701 0.246767 -0.395383 ... 0.673714 0.167650 1.476131 \n", "2 0.625457 0.369845 -0.443280 ... -0.096941 -0.025879 0.967100 \n", "3 0.317997 -0.872763 -0.809212 ... -0.392543 0.050554 0.026191 \n", "4 -0.116586 -0.548522 -1.379145 ... 0.548648 0.010977 0.174071 \n", "... ... ... ... ... ... ... ... \n", "23125 -0.184126 -0.289477 -0.732770 ... -0.750940 0.333212 0.500258 \n", "23126 -0.422137 0.382570 -0.086834 ... -0.269035 0.257387 0.684636 \n", "23127 0.420519 -0.328862 -0.170843 ... -0.729130 0.518689 0.360414 \n", "23128 0.340038 -0.293806 -0.226297 ... 0.881936 0.450375 0.289184 \n", "23129 0.480047 -0.763953 -0.598589 ... -0.372877 -0.195486 0.864477 \n", "\n", " 1018 1019 1020 1021 1022 1023 \\\n", "0 0.783085 -0.823547 1.258304 -0.043126 -0.472476 -0.351324 \n", "1 0.409790 -0.050008 0.582545 0.279141 -0.205470 -0.820358 \n", "2 -0.283345 0.691451 0.882124 0.534032 -0.739388 0.378053 \n", "3 -0.436454 -0.320372 0.507621 -0.305170 -0.284728 -0.776441 \n", "4 -0.338245 -0.298892 0.599860 0.190257 -0.457519 -1.003187 \n", "... ... ... ... ... ... ... \n", "23125 -0.642244 -0.251067 1.796083 -0.214922 -0.485310 -0.257823 \n", "23126 -0.135962 -0.267956 0.802172 -0.122769 -0.162395 0.466416 \n", "23127 0.419243 -0.244598 0.524628 -0.373507 -0.994924 -0.055463 \n", "23128 0.122812 0.063564 0.710250 -0.277890 -1.099730 0.753118 \n", "23129 0.186859 -1.127251 1.057047 -0.211675 0.377588 -0.937336 \n", "\n", " blog_title_lemmatized \n", "0 10 000 sheep collaborative art project \n", "1 10 best practices to help partners build aws q... \n", "2 10 things serverless architects should know \n", "3 10 years of success aws and alert logic \n", "4 10 years of success aws and appian \n", "... ... \n", "23125 re invent bonus content m e focused sessions o... \n", "23126 re invent open source highlights week 1 \n", "23127 redbus building a data platform with aws apach... \n", "23128 s2n and lucky 13 \n", "23129 wefoxgroup s migration to aws and amazon eks \n", "\n", "[23130 rows x 1025 columns]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_df" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "embeddings_filename = \"blog_title_embeddings.csv\"\n", "results_df.to_csv(embeddings_filename, index=False)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Uploading embeddings to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/embeddings\n", "Uploaded embeddings to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/embeddings\n" ] } ], "source": [ "embedding_upload_path = s3_path_join(\"s3://\",bucket_name,s3_prefix, 'embeddings')\n", "print(f\"Uploading embeddings to {embedding_upload_path}\")\n", "data_uri = S3Uploader.upload(embeddings_filename, embedding_upload_path)\n", "print(f\"Uploaded embeddings to {embedding_upload_path}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cluster titles" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "from sagemaker.sklearn.processing import SKLearnProcessor\n", "from sagemaker.processing import ProcessingInput, ProcessingOutput" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "sklearn_processor_spectral_clustering = SKLearnProcessor(framework_version='1.0-1',\n", " role=sagemaker_execution_role,\n", " instance_type='ml.m5.2xlarge',\n", " instance_count=1)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Job Name: sagemaker-scikit-learn-2023-03-14-02-20-48-009\n", "Inputs: [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/embeddings', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-230294632802/sagemaker-scikit-learn-2023-03-14-02-20-48-009/input/code/SpectralClustering.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]\n", "Outputs: [{'OutputName': 'titles_clusters', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/results/clusters', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]\n", "..........................\u001b[34mArgs:Namespace(affinity='cosine', assign_labels='kmeans', n_clusters=20, n_init=100, n_neighbors=10, random_state=None)\u001b[0m\n", "\u001b[34mLoading embeddings\u001b[0m\n", "\u001b[34mPerforming clustering\u001b[0m\n", "\u001b[34mSaving clusters!\u001b[0m\n", "\n" ] } ], "source": [ "output_destination = os.path.join('s3://', bucket_name, s3_prefix, \"results\", \"clusters\")\n", "\n", "sklearn_processor_spectral_clustering.run(\n", " code=\"./scikit-sagemaker-clustering/SpectralClustering.py\",\n", " inputs=[ProcessingInput(source=embedding_upload_path, destination=\"/opt/ml/processing/input\")],\n", " outputs=[ProcessingOutput(output_name=\"titles_clusters\", source=\"/opt/ml/processing/output\", destination=output_destination)],\n", " arguments=[\"--n-clusters\", str(n_clusters),\n", " \"--n-init\", \"100\",\n", " \"--affinity\", \"cosine\",\n", " \"--n-neighbors\", \"10\",\n", " \"--assign-labels\", \"kmeans\"\n", " ],\n", ")" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading cluster data to .\n", "Downloaded cluster data to .\n" ] } ], "source": [ "#Download the results. \n", "#The batch clustering job (step above) must have finished before you can run this cell.\n", "\n", "clusters_file = 'clustered_blog_titles_with_embeddings.csv'\n", "\n", "clusters_data_path = s3_path_join(\"s3://\", bucket_name, s3_prefix, \"results\", \"clusters\", clusters_file)\n", "print(f\"Downloading cluster data to .\")\n", "S3Downloader.download(clusters_data_path,'.')\n", "print(f\"Downloaded cluster data to .\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Automatic cluster labeling" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "clusters_df = pd.read_csv(clusters_file)\n", "clusters_titles = clusters_df[['blog_title_lemmatized', 'cluster_label']]" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
blog_title_lemmatizedcluster_label
010 000 sheep collaborative art project19
110 best practices to help partners build aws q...2
210 things serverless architects should know13
310 years of success aws and alert logic18
410 years of success aws and appian18
.........
23125re invent bonus content m e focused sessions o...15
23126re invent open source highlights week 115
23127redbus building a data platform with aws apach...17
23128s2n and lucky 1319
23129wefoxgroup s migration to aws and amazon eks18
\n", "

23130 rows × 2 columns

\n", "
" ], "text/plain": [ " blog_title_lemmatized cluster_label\n", "0 10 000 sheep collaborative art project 19\n", "1 10 best practices to help partners build aws q... 2\n", "2 10 things serverless architects should know 13\n", "3 10 years of success aws and alert logic 18\n", "4 10 years of success aws and appian 18\n", "... ... ...\n", "23125 re invent bonus content m e focused sessions o... 15\n", "23126 re invent open source highlights week 1 15\n", "23127 redbus building a data platform with aws apach... 17\n", "23128 s2n and lucky 13 19\n", "23129 wefoxgroup s migration to aws and amazon eks 18\n", "\n", "[23130 rows x 2 columns]" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters_titles" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "clusters = [clusters_titles.loc[clusters_titles.cluster_label == i, 'blog_title_lemmatized'].to_list() for i in range(0, n_clusters)]" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "clusters_tf_idf = []\n", "clusters_tf_idf_terms = []\n", "clusters_tags = []\n", "clusters_keywords_tf_idf = []\n", "tf_idf_threshold = 0.2\n", "\n", "for cluster in clusters:\n", "\n", " tfIdfVectorizer = TfidfVectorizer(use_idf=True)\n", " tfIdf = tfIdfVectorizer.fit_transform(cluster)\n", " tf_idf_df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names(), columns=[\"TF-IDF\"])\n", " tf_idf_df = tf_idf_df.sort_values('TF-IDF', ascending=False)\n", " \n", " clusters_tf_idf.append(tf_idf_df)\n", " \n", " cluster_tf_idf_terms = list(tf_idf_df.loc[tf_idf_df['TF-IDF'] > tf_idf_threshold].index.values)\n", " clusters_tf_idf_terms.append(cluster_tf_idf_terms)\n", " \n", " tags = nltk.pos_tag(cluster_tf_idf_terms)\n", " clusters_tags.append(tags)\n", " \n", " keywords = [tag[0] for tag in tags if tag[1] in ['NN', 'NNS'] and tag[0] not in ['aws', 'amazon']]\n", " clusters_keywords_tf_idf.append(keywords)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['blockchain', 'datasets', 'data'],\n", " ['alces', 'flight', 'series', 'sector', 'university', 'power', 'research'],\n", " ['starts', 'practices', 'help', 'customers'],\n", " ['visualizations', 'data', 'quicksight'],\n", " ['success', 'years'],\n", " ['turbines', 'drone', 'wind', 'inspection', 'ai', 'driven'],\n", " ['simpler', 'sam', 'experience', 'deployment', 'cli'],\n", " ['feedback', 'whitepapers', 'videos', 'articles', 'year'],\n", " ['backup', 'ways', 'plans', 'rules'],\n", " ['patient', 'health', 'part', 'building', 'digital'],\n", " ['innovation', 'years'],\n", " ['switzerland', 'isae', 'finma', 'attestation', 'type', 'report'],\n", " ['experience', 'style', 'shop', 'pytorch'],\n", " ['architects', 'things'],\n", " ['review', 'year', 'mongodb', 'compatibility'],\n", " ['catalog', 'session', 'compliance', 'security'],\n", " ['decade', 'iops', 'ebs'],\n", " ['things', 'compatibility', 'mongodb'],\n", " ['years', 'success'],\n", " ['art', 'project']]" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters_keywords_tf_idf" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "clusters_df['categories'] = clusters_titles['cluster_label'].map(lambda i: clusters_keywords_tf_idf[i])" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
blog_title_lemmatizedcluster_labelcategories
4822 new or updated open datasets on aws new pol...0[blockchain, datasets, data]
533 gain insights from complex data featuring 3m0[blockchain, datasets, data]
634 steps to train and deploy machine learning m...0[blockchain, datasets, data]
10270 datasets inspire winning ideas to tackle op...0[blockchain, datasets, data]
105890 by capgemini with aws powering business de...0[blockchain, datasets, data]
............
22916what s around the turn in 2021 aws deepracer l...0[blockchain, datasets, data]
22972why our customers love amazon machine learning...0[blockchain, datasets, data]
22989why use docker containers for machine learning...0[blockchain, datasets, data]
22994will spark power the data behind precision med...0[blockchain, datasets, data]
23050yewno uses aws and ml to analyze vast amounts ...0[blockchain, datasets, data]
\n", "

1060 rows × 3 columns

\n", "
" ], "text/plain": [ " blog_title_lemmatized cluster_label \\\n", "48 22 new or updated open datasets on aws new pol... 0 \n", "53 3 gain insights from complex data featuring 3m 0 \n", "63 4 steps to train and deploy machine learning m... 0 \n", "102 70 datasets inspire winning ideas to tackle op... 0 \n", "105 890 by capgemini with aws powering business de... 0 \n", "... ... ... \n", "22916 what s around the turn in 2021 aws deepracer l... 0 \n", "22972 why our customers love amazon machine learning... 0 \n", "22989 why use docker containers for machine learning... 0 \n", "22994 will spark power the data behind precision med... 0 \n", "23050 yewno uses aws and ml to analyze vast amounts ... 0 \n", "\n", " categories \n", "48 [blockchain, datasets, data] \n", "53 [blockchain, datasets, data] \n", "63 [blockchain, datasets, data] \n", "102 [blockchain, datasets, data] \n", "105 [blockchain, datasets, data] \n", "... ... \n", "22916 [blockchain, datasets, data] \n", "22972 [blockchain, datasets, data] \n", "22989 [blockchain, datasets, data] \n", "22994 [blockchain, datasets, data] \n", "23050 [blockchain, datasets, data] \n", "\n", "[1060 rows x 3 columns]" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters_df.loc[clusters_df['cluster_label']==0, ['blog_title_lemmatized', 'cluster_label', 'categories']]" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "clusters_categories_file = 'aws_blog_titles_clusters_categories.csv'\n", "clusters_df.to_csv(clusters_categories_file, index=False)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Uploading clusters to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/results/clusters\n", "Uploaded clusters to s3://unsupervised-phrase-clustering-with-sagemaker/text-clustering-with-transformers/data/results/clusters\n" ] } ], "source": [ "clusters_data_path = s3_path_join(\"s3://\", bucket_name, s3_prefix, \"results\", \"clusters\")\n", "print(f\"Uploading clusters to {clusters_data_path}\")\n", "clusters_file_uri = S3Uploader.upload(clusters_categories_file, clusters_data_path)\n", "print(f\"Uploaded clusters to {clusters_data_path}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize the clusters" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " after removing the cwd from sys.path.\n" ] } ], "source": [ "cluster_sample_df = clusters_df.sample(n=1000).reset_index()\n", "title_embeddings_sample = cluster_sample_df.iloc[:,:-3]\n", "clusters_titles_sample = cluster_sample_df[['blog_title_lemmatized', 'cluster_label', 'categories']]\n", "clusters_titles_sample['short_categories'] = clusters_titles_sample['categories'].map(lambda x: x[:2])" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
blog_title_lemmatizedcluster_labelcategoriesshort_categories
0new aws lambda scaling controls for kinesis an...5[turbines, drone, wind, inspection, ai, driven][turbines, drone]
1high growth innovation powered by technology9[patient, health, part, building, digital][patient, health]
2aws java dao integration project18[years, success][years, success]
3aws online tech talks november 201718[years, success][years, success]
4icymi new stories updates and resources from a...2[starts, practices, help, customers][starts, practices]
...............
995announcing self service blacklisted address re...8[backup, ways, plans, rules][backup, ways]
996improving daemon services in amazon ecs16[decade, iops, ebs][decade, iops]
997improve your website availability with amazon ...4[success, years][success, years]
998how to package cookbook dependencies locally w...16[decade, iops, ebs][decade, iops]
999now hiring aws solutions architects18[years, success][years, success]
\n", "

1000 rows × 4 columns

\n", "
" ], "text/plain": [ " blog_title_lemmatized cluster_label \\\n", "0 new aws lambda scaling controls for kinesis an... 5 \n", "1 high growth innovation powered by technology 9 \n", "2 aws java dao integration project 18 \n", "3 aws online tech talks november 2017 18 \n", "4 icymi new stories updates and resources from a... 2 \n", ".. ... ... \n", "995 announcing self service blacklisted address re... 8 \n", "996 improving daemon services in amazon ecs 16 \n", "997 improve your website availability with amazon ... 4 \n", "998 how to package cookbook dependencies locally w... 16 \n", "999 now hiring aws solutions architects 18 \n", "\n", " categories short_categories \n", "0 [turbines, drone, wind, inspection, ai, driven] [turbines, drone] \n", "1 [patient, health, part, building, digital] [patient, health] \n", "2 [years, success] [years, success] \n", "3 [years, success] [years, success] \n", "4 [starts, practices, help, customers] [starts, practices] \n", ".. ... ... \n", "995 [backup, ways, plans, rules] [backup, ways] \n", "996 [decade, iops, ebs] [decade, iops] \n", "997 [success, years] [success, years] \n", "998 [decade, iops, ebs] [decade, iops] \n", "999 [years, success] [years, success] \n", "\n", "[1000 rows x 4 columns]" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters_titles_sample" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "clusters_tsne = TSNE(perplexity=13, n_components=2, init='pca', n_iter=5000)\n", "tsne_embeddings = clusters_tsne.fit_transform(title_embeddings_sample)\n", "tsne_embeddings_df = pd.DataFrame(tsne_embeddings, columns=['x', 'y'])\n", "tsne_embeddings_df['cluster'] = clusters_titles_sample['cluster_label']\n", "tsne_embeddings_df['labels'] = clusters_titles_sample['categories']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "tsne_embeddings_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "colors=[\n", " '#efaf50',\n", " '#a09934',\n", " '#e31ad9',\n", " '#cfcbb0',\n", " '#1224c9',\n", " '#669fa4',\n", " '#087274',\n", " '#787168',\n", " '#3e93cb',\n", " '#722823',\n", " '#c8784c',\n", " '#74ac48',\n", " '#c31033',\n", " '#5acc21',\n", " '#2ef8ba',\n", " '#c67ebe',\n", " '#805004',\n", " '#a8f43b',\n", " '#442d6d',\n", " '#9141ea',\n", "]\n", "\n", "fig, ax = plt.subplots(figsize=(30,30))\n", "ax = sns.scatterplot(data=tsne_embeddings_df, x='x', y='y', hue='cluster', legend='full', palette=colors, ax=ax)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tsne_embeddings_df[['cluster', 'labels']].drop_duplicates('cluster').sort_values('cluster')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.m5.large", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }