{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Create a short text clustering system using AWS SageMaker jumpstart pre-trained transformer models " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. [Introduction](#Introduction) \n", "2. [Setup](#Setup)\n", "3. [Create a model for text embeddings from the Jumpstart solutions library of models](#Create-a-model-for-text-embeddings-from-the-Jumpstart-solutions-library-of-models)\n", "4. [Data pre-processing](#Data-pre-processing)\n", "5. [Create phrase (sentence) embeddings](#Create-phrase-(sentence)-embeddings)\n", "6. [Cluster phrases (sentences)](#Cluster-phrases-(sentences))\n", "7. [Automatic cluster labeling](#Automatic-cluster-labeling)\n", "8. [Batch process the entire dataset](#Batch-process-the-entire-dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we demonstrate how you can cluster short text (phrases) using the pre-trained transformer models on [AWS SageMaker Jumpstart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html). Here we will demonstrate the use of a transformer model called [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli). The model is used to create an embedding of phrases that we will then use to cluster such phrases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by updating the required packages i.e. SageMaker Python SDK, pandas, numpy, etc." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Keyring is skipped due to an exception: 'keyring.backends'\n", "Requirement already satisfied: boto3 in /opt/conda/lib/python3.7/site-packages (1.26.24)\n", "Requirement already satisfied: jsonlines in /opt/conda/lib/python3.7/site-packages (3.1.0)\n", "Requirement already satisfied: seaborn in /opt/conda/lib/python3.7/site-packages (0.10.0)\n", "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from boto3) (0.6.0)\n", "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3) (1.0.1)\n", "Requirement already satisfied: botocore<1.30.0,>=1.29.24 in /opt/conda/lib/python3.7/site-packages (from boto3) (1.29.24)\n", "Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.7/site-packages (from jsonlines) (4.4.0)\n", "Requirement already satisfied: attrs>=19.2.0 in /opt/conda/lib/python3.7/site-packages (from jsonlines) (22.1.0)\n", "Requirement already satisfied: pandas>=0.22.0 in /opt/conda/lib/python3.7/site-packages (from seaborn) (1.3.5)\n", "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from seaborn) (1.21.6)\n", "Requirement already satisfied: scipy>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from seaborn) (1.4.1)\n", "Requirement already satisfied: matplotlib>=2.1.2 in /opt/conda/lib/python3.7/site-packages (from seaborn) (3.1.3)\n", "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.24->boto3) (1.26.13)\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.24->boto3) (2.8.2)\n", "Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (1.1.0)\n", "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (2.4.6)\n", "Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib>=2.1.2->seaborn) (0.10.0)\n", "Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.22.0->seaborn) (2019.3)\n", "Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from cycler>=0.10->matplotlib>=2.1.2->seaborn) (1.14.0)\n", "Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.1.2->seaborn) (59.3.0)\n", "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n", "\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n", "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" ] } ], "source": [ "!pip install boto3 jsonlines seaborn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **Note: Restart the notebook's kernel after installing the above packages.**" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import sagemaker\n", "import json\n", "import re\n", "import os\n", "\n", "from sagemaker import get_execution_role\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import math\n", "\n", "import nltk\n", "from nltk.corpus import stopwords\n", "\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.feature_extraction.text import TfidfTransformer \n", "from sklearn.feature_extraction.text import CountVectorizer \n", "\n", "from sklearn.manifold import TSNE\n", "from sklearn.cluster import SpectralClustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use NLTK library to help us with the pre-processing of the data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [], "source": [ "session = boto3.Session()\n", "sagemaker_execution_role = get_execution_role()\n", "s3 = session.resource('s3')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.download('stopwords')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a model for text embeddings from the Jumpstart solutions library of models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use one the text embedding [models available in SageMaker jumpstart](https://sagemaker.readthedocs.io/en/v2.129.0/doc_utils/pretrainedmodels.html)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "#### Chose a model for Inference" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [], "source": [ "#We choose the tensorflow-tcembedding-universal-sentence-encoder-cmlm-en-large-1 as the default model since it is better suited for phrase analysis\n", "\n", "model_id, model_version = (\n", " \"tensorflow-tcembedding-universal-sentence-encoder-cmlm-en-large-1\", \n", " \"*\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "You can continue with the default model, or can choose a different model from the dropdown generated upon running the next cell. A complete list of SageMaker pre-trained models can also be accessed at [Sagemaker pre-trained Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [] }, "outputs": [], "source": [ "from ipywidgets import Dropdown\n", "from sagemaker.jumpstart.notebook_utils import list_jumpstart_models, list_jumpstart_tasks\n", "\n", "# Retrieves all text embedding models.\n", "filter_value = \"task == tcembedding\"\n", "tcembedding_models = list_jumpstart_models(filter=filter_value)\n", "\n", "# display the model-ids in a dropdown to select a model for inference.\n", "model_dropdown = Dropdown(\n", " options=tcembedding_models,\n", " value=model_id,\n", " description=\"Select a model\",\n", " style={\"description_width\": \"initial\"},\n", " layout={\"width\": \"max-content\"},\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ae82576e5f51430da1a128adba6a276d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Select a model', index=33, layout=Layout(width='max-content'), options=('mxnet-tcembeddi…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(model_dropdown)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# model_version=\"*\" fetches the latest version of the model\n", "model_id, model_version = model_dropdown.value, \"*\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sagemaker.utils import name_from_base\n", "\n", "model_name = name_from_base(f\"jumpstart-example-infer-{model_id}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the model from the selected model_id, model_version" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from sagemaker import image_uris, model_uris, script_uris, hyperparameters\n", "from sagemaker.model import Model\n", "from sagemaker.predictor import Predictor\n", "\n", "inference_instance_type = \"ml.m5.xlarge\" #You can change the instance according to your needs\n", "\n", "# Retrieve the inference docker container uri. This is the base Tensorflow container image for the default model above.\n", "deploy_image_uri = image_uris.retrieve(\n", " region=None,\n", " framework=None, # automatically inferred from model_id\n", " image_scope=\"inference\",\n", " model_id=model_id,\n", " model_version=model_version,\n", " instance_type=inference_instance_type,\n", ")\n", "\n", "# Retrieve the inference script uri. This includes all dependencies and scripts for model loading, inference handling etc.\n", "deploy_source_uri = script_uris.retrieve(\n", " model_id=model_id, model_version=model_version, script_scope=\"inference\"\n", ")\n", "\n", "\n", "# Retrieve the model uri. This includes the model and model parameters.\n", "model_uri = model_uris.retrieve(\n", " model_id=model_id, model_version=model_version, model_scope=\"inference\"\n", ")\n", "\n", "\n", "# Create the SageMaker model instance\n", "embedding_model = Model(\n", " image_uri=deploy_image_uri,\n", " source_dir=deploy_source_uri,\n", " model_data=model_uri,\n", " entry_point=\"inference.py\", # entry point file in source_dir and present in deploy_source_uri\n", " role=sagemaker_execution_role,\n", " predictor_cls=Predictor,\n", " name=model_name,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data pre-processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this demonstration we will use a dataset made up of the blog titles for each blog published by AWS from 2004 until late 2022. We use the blog titles to cluster them and assign a topic to each cluster\n", "\n", "The text is pre-processed with the following steps:\n", "\n", "* Set category string to lowercase\n", "* Replace acronyms with actual words \n", "* Replace special word-bound characters such as / and - (i.e.: imagenes/videos, cerveza-vino) to get separate words.\n", "* Eliminate explanations between parenthesis\n", "* Remove any other non-word characters from sentence\n", "* Split sentence into tokens\n", "* Singularize each token" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "blogs_df = pd.read_csv('aws_blog_titles.csv', header=None, names=['URL', 'Title'])\n", "\n", "aws_acronyms_df = pd.read_csv('acronyms.csv', header=None, delimiter=';', names=['acronym', 'meaning'])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | URL | \n", "Title | \n", "
---|---|---|
0 | \n", "https://aws.amazon.com/blogs/aws/10000_sheep_col/ | \n", "10 000 Sheep Collaborative Art Project | \n", "
1 | \n", "https://aws.amazon.com/blogs/apn/10-best-pract... | \n", "10 Best Practices to Help Partners Build AWS Q... | \n", "
2 | \n", "https://aws.amazon.com/blogs/architecture/ten-... | \n", "10 Things Serverless Architects Should Know | \n", "
3 | \n", "https://aws.amazon.com/blogs/apn/10-years-of-s... | \n", "10 Years of Success AWS and Alert Logic | \n", "
4 | \n", "https://aws.amazon.com/blogs/apn/10-years-of-s... | \n", "10 Years of Success AWS and Appian | \n", "
... | \n", "... | \n", "... | \n", "
23195 | \n", "https://aws.amazon.com/blogs/media/reinvent-bo... | \n", "re Invent bonus content M E focused sessions o... | \n", "
23196 | \n", "https://aws.amazon.com/blogs/opensource/reinve... | \n", "re Invent open source highlights Week 1 | \n", "
23197 | \n", "https://aws.amazon.com/blogs/startups/redbus-b... | \n", "redBus Building a Data Platform with AWS Apach... | \n", "
23198 | \n", "https://aws.amazon.com/blogs/security/s2n-and-... | \n", "s2n and Lucky 13 | \n", "
23199 | \n", "https://aws.amazon.com/blogs/startups/wefoxgro... | \n", "wefoxgroup s Migration to AWS and Amazon EKS | \n", "
23200 rows × 2 columns
\n", "\n", " | acronym | \n", "meaning | \n", "
---|---|---|
0 | \n", "Amazon ES | \n", "Amazon Elasticsearch Service | \n", "
1 | \n", "AMI | \n", "Amazon Machine Image | \n", "
2 | \n", "API | \n", "Application Programming Interface | \n", "
3 | \n", "AI | \n", "Artificial Intelligence | \n", "
4 | \n", "ACL | \n", "Access Control List | \n", "
... | \n", "... | \n", "... | \n", "
120 | \n", "VPN | \n", "Virtual Private Network | \n", "
121 | \n", "VLAN | \n", "Virtual Local Area Network | \n", "
122 | \n", "VDI | \n", "Virtual Desktop Infrastructure | \n", "
123 | \n", "VPG | \n", "Virtual Private Gateway | \n", "
124 | \n", "WAF | \n", "Web Application Firewall | \n", "
125 rows × 2 columns
\n", "\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "... | \n", "1015 | \n", "1016 | \n", "1017 | \n", "1018 | \n", "1019 | \n", "1020 | \n", "1021 | \n", "1022 | \n", "1023 | \n", "blog_title_lemmatized | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "-1.259244 | \n", "0.102127 | \n", "0.089923 | \n", "-0.820289 | \n", "-0.223537 | \n", "-0.432897 | \n", "0.799307 | \n", "-0.674757 | \n", "-0.252520 | \n", "-0.272772 | \n", "... | \n", "0.141702 | \n", "-0.392942 | \n", "0.119546 | \n", "-0.386988 | \n", "-0.568477 | \n", "0.205832 | \n", "-0.170402 | \n", "-1.025766 | \n", "0.198024 | \n", "nova brings data science courses to the usmc w... | \n", "
1 | \n", "-0.692962 | \n", "0.259012 | \n", "-0.748741 | \n", "-0.866656 | \n", "-0.901526 | \n", "-0.997768 | \n", "0.377633 | \n", "-0.053566 | \n", "-0.503752 | \n", "0.144736 | \n", "... | \n", "0.114172 | \n", "0.522053 | \n", "0.834067 | \n", "-0.771977 | \n", "-0.466612 | \n", "0.195516 | \n", "-0.715755 | \n", "0.326673 | \n", "-0.530868 | \n", "cloud native application monitoring for aws | \n", "
2 | \n", "-0.652742 | \n", "0.126623 | \n", "0.094145 | \n", "-0.545012 | \n", "-0.761059 | \n", "-0.479529 | \n", "-0.119898 | \n", "0.001302 | \n", "-0.347351 | \n", "0.485530 | \n", "... | \n", "0.158906 | \n", "-0.314114 | \n", "0.502422 | \n", "-0.112350 | \n", "-0.143338 | \n", "-0.790585 | \n", "-0.104163 | \n", "-0.003831 | \n", "-0.414520 | \n", "sharing matlab applications on aws using the m... | \n", "
3 | \n", "0.157903 | \n", "-1.218128 | \n", "-0.358249 | \n", "-0.143645 | \n", "-0.785935 | \n", "0.111689 | \n", "0.451628 | \n", "-0.040183 | \n", "0.232529 | \n", "0.813196 | \n", "... | \n", "0.574929 | \n", "-0.595977 | \n", "0.733360 | \n", "0.228505 | \n", "0.484546 | \n", "0.407535 | \n", "-0.656069 | \n", "0.465197 | \n", "0.631824 | \n", "amazon elastic file system shared file storage... | \n", "
4 | \n", "-0.199768 | \n", "0.498280 | \n", "-0.096403 | \n", "-1.235060 | \n", "-0.164219 | \n", "-0.747414 | \n", "-0.280207 | \n", "0.717804 | \n", "-1.201374 | \n", "0.192102 | \n", "... | \n", "-0.422836 | \n", "0.048728 | \n", "0.729825 | \n", "-0.974279 | \n", "0.452009 | \n", "0.646301 | \n", "-0.731122 | \n", "0.526780 | \n", "-0.412448 | \n", "duplicating infrastructure on aws | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
995 | \n", "-0.190524 | \n", "0.190522 | \n", "0.247199 | \n", "-0.909789 | \n", "-0.532487 | \n", "-0.958266 | \n", "-0.366771 | \n", "-0.469838 | \n", "-0.685504 | \n", "-0.749760 | \n", "... | \n", "-0.482015 | \n", "0.358346 | \n", "0.935664 | \n", "-0.944232 | \n", "0.374791 | \n", "0.138823 | \n", "0.717717 | \n", "-0.581709 | \n", "0.456586 | \n", "aws week in review august 25 2014 | \n", "
996 | \n", "-0.524172 | \n", "-0.508515 | \n", "0.474869 | \n", "-1.093429 | \n", "-1.255689 | \n", "-0.688424 | \n", "0.898723 | \n", "0.750922 | \n", "-1.154738 | \n", "1.334882 | \n", "... | \n", "0.765626 | \n", "-0.240315 | \n", "0.339106 | \n", "-0.142825 | \n", "0.226154 | \n", "-0.350119 | \n", "-0.670260 | \n", "-0.665765 | \n", "-0.013468 | \n", "optimizing performance for users in china with... | \n", "
997 | \n", "-0.404731 | \n", "0.581385 | \n", "-0.491551 | \n", "-0.588944 | \n", "-0.035981 | \n", "-0.529826 | \n", "0.459939 | \n", "0.522350 | \n", "-0.774718 | \n", "0.081250 | \n", "... | \n", "0.718470 | \n", "-0.013323 | \n", "1.003814 | \n", "-0.764670 | \n", "-0.364241 | \n", "0.231567 | \n", "-0.153268 | \n", "-0.106582 | \n", "-1.071888 | \n", "aws accelerator for citrix migrate or deploy x... | \n", "
998 | \n", "0.307507 | \n", "0.646797 | \n", "0.672350 | \n", "-0.879446 | \n", "-0.829220 | \n", "-0.173903 | \n", "0.130434 | \n", "-0.228525 | \n", "0.172912 | \n", "-0.866439 | \n", "... | \n", "-0.857214 | \n", "-0.324904 | \n", "1.445891 | \n", "-0.198214 | \n", "-0.104684 | \n", "0.646890 | \n", "-0.060024 | \n", "-0.384143 | \n", "0.098519 | \n", "in case you missed it september 2019 top blog ... | \n", "
999 | \n", "0.926491 | \n", "-0.443366 | \n", "0.707322 | \n", "-0.335114 | \n", "-1.174717 | \n", "-0.123556 | \n", "0.919839 | \n", "0.648331 | \n", "0.875201 | \n", "-0.280192 | \n", "... | \n", "0.192803 | \n", "-0.163193 | \n", "0.983904 | \n", "-0.519547 | \n", "0.676546 | \n", "0.539295 | \n", "-0.293465 | \n", "0.378826 | \n", "0.312693 | \n", "why avatars are usually awful and how snappr f... | \n", "
1000 rows × 1025 columns
\n", "\n", " | blog_title_lemmatized | \n", "cluster | \n", "
---|---|---|
0 | \n", "nova brings data science courses to the usmc w... | \n", "9 | \n", "
1 | \n", "cloud native application monitoring for aws | \n", "17 | \n", "
2 | \n", "sharing matlab applications on aws using the m... | \n", "0 | \n", "
3 | \n", "amazon elastic file system shared file storage... | \n", "5 | \n", "
4 | \n", "duplicating infrastructure on aws | \n", "17 | \n", "
... | \n", "... | \n", "... | \n", "
995 | \n", "aws week in review august 25 2014 | \n", "6 | \n", "
996 | \n", "optimizing performance for users in china with... | \n", "8 | \n", "
997 | \n", "aws accelerator for citrix migrate or deploy x... | \n", "17 | \n", "
998 | \n", "in case you missed it september 2019 top blog ... | \n", "6 | \n", "
999 | \n", "why avatars are usually awful and how snappr f... | \n", "9 | \n", "
1000 rows × 2 columns
\n", "\n", " | idf_weights | \n", "word | \n", "
---|---|---|
aws | \n", "1.117783 | \n", "aws | \n", "
the | \n", "1.492476 | \n", "the | \n", "
for | \n", "1.750306 | \n", "for | \n", "
sdk | \n", "1.944462 | \n", "sdk | \n", "
with | \n", "1.944462 | \n", "with | \n", "
and | \n", "2.098612 | \n", "and | \n", "
to | \n", "2.386294 | \n", "to | \n", "
using | \n", "2.504077 | \n", "using | \n", "
android | \n", "2.791759 | \n", "android | \n", "
on | \n", "2.791759 | \n", "on | \n", "
in | \n", "2.791759 | \n", "in | \n", "
an | \n", "2.791759 | \n", "an | \n", "
console | \n", "2.974081 | \n", "console | \n", "
app | \n", "2.974081 | \n", "app | \n", "
application | \n", "2.974081 | \n", "application | \n", "
amazon | \n", "2.974081 | \n", "amazon | \n", "
new | \n", "2.974081 | \n", "new | \n", "
open | \n", "3.197225 | \n", "open | \n", "
release | \n", "3.197225 | \n", "release | \n", "
build | \n", "3.197225 | \n", "build | \n", "
\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "... | \n", "1015 | \n", "1016 | \n", "1017 | \n", "1018 | \n", "1019 | \n", "1020 | \n", "1021 | \n", "1022 | \n", "1023 | \n", "blog_title_lemmatized | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.188292 | \n", "-0.390401 | \n", "-0.286212 | \n", "-0.732304 | \n", "-0.841209 | \n", "-1.088366 | \n", "-0.598213 | \n", "1.023898 | \n", "0.821023 | \n", "-1.058815 | \n", "... | \n", "-0.175997 | \n", "0.245866 | \n", "0.706102 | \n", "0.783085 | \n", "-0.823547 | \n", "1.258304 | \n", "-0.043126 | \n", "-0.472476 | \n", "-0.351324 | \n", "10 000 sheep collaborative art project | \n", "
1 | \n", "-0.264925 | \n", "-0.108307 | \n", "0.688968 | \n", "-0.680849 | \n", "-0.351448 | \n", "-0.899282 | \n", "0.391024 | \n", "-0.635701 | \n", "0.246767 | \n", "-0.395383 | \n", "... | \n", "0.673714 | \n", "0.167650 | \n", "1.476131 | \n", "0.409790 | \n", "-0.050008 | \n", "0.582545 | \n", "0.279141 | \n", "-0.205470 | \n", "-0.820358 | \n", "10 best practices to help partners build aws q... | \n", "
2 | \n", "0.371885 | \n", "0.771379 | \n", "1.069450 | \n", "-1.008337 | \n", "0.655574 | \n", "-0.955685 | \n", "-0.711391 | \n", "0.625457 | \n", "0.369845 | \n", "-0.443280 | \n", "... | \n", "-0.096941 | \n", "-0.025879 | \n", "0.967100 | \n", "-0.283345 | \n", "0.691451 | \n", "0.882124 | \n", "0.534032 | \n", "-0.739388 | \n", "0.378053 | \n", "10 things serverless architects should know | \n", "
3 | \n", "-0.301370 | \n", "0.847713 | \n", "0.486485 | \n", "-0.280148 | \n", "-1.502466 | \n", "-1.012401 | \n", "-0.091017 | \n", "0.317997 | \n", "-0.872763 | \n", "-0.809212 | \n", "... | \n", "-0.392543 | \n", "0.050554 | \n", "0.026191 | \n", "-0.436454 | \n", "-0.320372 | \n", "0.507621 | \n", "-0.305170 | \n", "-0.284728 | \n", "-0.776441 | \n", "10 years of success aws and alert logic | \n", "
4 | \n", "0.029871 | \n", "0.295120 | \n", "0.633014 | \n", "-0.412160 | \n", "-1.094922 | \n", "-0.794995 | \n", "0.185795 | \n", "-0.116586 | \n", "-0.548522 | \n", "-1.379145 | \n", "... | \n", "0.548648 | \n", "0.010977 | \n", "0.174071 | \n", "-0.338245 | \n", "-0.298892 | \n", "0.599860 | \n", "0.190257 | \n", "-0.457519 | \n", "-1.003187 | \n", "10 years of success aws and appian | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
23125 | \n", "-0.022962 | \n", "1.256789 | \n", "0.323558 | \n", "-0.199724 | \n", "0.179075 | \n", "0.219997 | \n", "0.139460 | \n", "-0.184126 | \n", "-0.289477 | \n", "-0.732770 | \n", "... | \n", "-0.750940 | \n", "0.333212 | \n", "0.500258 | \n", "-0.642244 | \n", "-0.251067 | \n", "1.796083 | \n", "-0.214922 | \n", "-0.485310 | \n", "-0.257823 | \n", "re invent bonus content m e focused sessions o... | \n", "
23126 | \n", "0.700433 | \n", "0.953094 | \n", "1.036029 | \n", "-0.781870 | \n", "-0.111338 | \n", "-0.478488 | \n", "0.362508 | \n", "-0.422137 | \n", "0.382570 | \n", "-0.086834 | \n", "... | \n", "-0.269035 | \n", "0.257387 | \n", "0.684636 | \n", "-0.135962 | \n", "-0.267956 | \n", "0.802172 | \n", "-0.122769 | \n", "-0.162395 | \n", "0.466416 | \n", "re invent open source highlights week 1 | \n", "
23127 | \n", "0.096553 | \n", "0.953412 | \n", "0.493536 | \n", "-1.050031 | \n", "0.031639 | \n", "-0.762166 | \n", "-0.221962 | \n", "0.420519 | \n", "-0.328862 | \n", "-0.170843 | \n", "... | \n", "-0.729130 | \n", "0.518689 | \n", "0.360414 | \n", "0.419243 | \n", "-0.244598 | \n", "0.524628 | \n", "-0.373507 | \n", "-0.994924 | \n", "-0.055463 | \n", "redbus building a data platform with aws apach... | \n", "
23128 | \n", "-0.662509 | \n", "-0.092624 | \n", "0.064623 | \n", "-0.446625 | \n", "-0.261067 | \n", "-0.437387 | \n", "0.701777 | \n", "0.340038 | \n", "-0.293806 | \n", "-0.226297 | \n", "... | \n", "0.881936 | \n", "0.450375 | \n", "0.289184 | \n", "0.122812 | \n", "0.063564 | \n", "0.710250 | \n", "-0.277890 | \n", "-1.099730 | \n", "0.753118 | \n", "s2n and lucky 13 | \n", "
23129 | \n", "-0.542878 | \n", "-0.237617 | \n", "0.501378 | \n", "-1.180179 | \n", "0.353943 | \n", "-0.340788 | \n", "-0.538909 | \n", "0.480047 | \n", "-0.763953 | \n", "-0.598589 | \n", "... | \n", "-0.372877 | \n", "-0.195486 | \n", "0.864477 | \n", "0.186859 | \n", "-1.127251 | \n", "1.057047 | \n", "-0.211675 | \n", "0.377588 | \n", "-0.937336 | \n", "wefoxgroup s migration to aws and amazon eks | \n", "
23130 rows × 1025 columns
\n", "\n", " | blog_title_lemmatized | \n", "cluster_label | \n", "
---|---|---|
0 | \n", "10 000 sheep collaborative art project | \n", "19 | \n", "
1 | \n", "10 best practices to help partners build aws q... | \n", "2 | \n", "
2 | \n", "10 things serverless architects should know | \n", "13 | \n", "
3 | \n", "10 years of success aws and alert logic | \n", "18 | \n", "
4 | \n", "10 years of success aws and appian | \n", "18 | \n", "
... | \n", "... | \n", "... | \n", "
23125 | \n", "re invent bonus content m e focused sessions o... | \n", "15 | \n", "
23126 | \n", "re invent open source highlights week 1 | \n", "15 | \n", "
23127 | \n", "redbus building a data platform with aws apach... | \n", "17 | \n", "
23128 | \n", "s2n and lucky 13 | \n", "19 | \n", "
23129 | \n", "wefoxgroup s migration to aws and amazon eks | \n", "18 | \n", "
23130 rows × 2 columns
\n", "\n", " | blog_title_lemmatized | \n", "cluster_label | \n", "categories | \n", "
---|---|---|---|
48 | \n", "22 new or updated open datasets on aws new pol... | \n", "0 | \n", "[blockchain, datasets, data] | \n", "
53 | \n", "3 gain insights from complex data featuring 3m | \n", "0 | \n", "[blockchain, datasets, data] | \n", "
63 | \n", "4 steps to train and deploy machine learning m... | \n", "0 | \n", "[blockchain, datasets, data] | \n", "
102 | \n", "70 datasets inspire winning ideas to tackle op... | \n", "0 | \n", "[blockchain, datasets, data] | \n", "
105 | \n", "890 by capgemini with aws powering business de... | \n", "0 | \n", "[blockchain, datasets, data] | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
22916 | \n", "what s around the turn in 2021 aws deepracer l... | \n", "0 | \n", "[blockchain, datasets, data] | \n", "
22972 | \n", "why our customers love amazon machine learning... | \n", "0 | \n", "[blockchain, datasets, data] | \n", "
22989 | \n", "why use docker containers for machine learning... | \n", "0 | \n", "[blockchain, datasets, data] | \n", "
22994 | \n", "will spark power the data behind precision med... | \n", "0 | \n", "[blockchain, datasets, data] | \n", "
23050 | \n", "yewno uses aws and ml to analyze vast amounts ... | \n", "0 | \n", "[blockchain, datasets, data] | \n", "
1060 rows × 3 columns
\n", "\n", " | blog_title_lemmatized | \n", "cluster_label | \n", "categories | \n", "short_categories | \n", "
---|---|---|---|---|
0 | \n", "new aws lambda scaling controls for kinesis an... | \n", "5 | \n", "[turbines, drone, wind, inspection, ai, driven] | \n", "[turbines, drone] | \n", "
1 | \n", "high growth innovation powered by technology | \n", "9 | \n", "[patient, health, part, building, digital] | \n", "[patient, health] | \n", "
2 | \n", "aws java dao integration project | \n", "18 | \n", "[years, success] | \n", "[years, success] | \n", "
3 | \n", "aws online tech talks november 2017 | \n", "18 | \n", "[years, success] | \n", "[years, success] | \n", "
4 | \n", "icymi new stories updates and resources from a... | \n", "2 | \n", "[starts, practices, help, customers] | \n", "[starts, practices] | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
995 | \n", "announcing self service blacklisted address re... | \n", "8 | \n", "[backup, ways, plans, rules] | \n", "[backup, ways] | \n", "
996 | \n", "improving daemon services in amazon ecs | \n", "16 | \n", "[decade, iops, ebs] | \n", "[decade, iops] | \n", "
997 | \n", "improve your website availability with amazon ... | \n", "4 | \n", "[success, years] | \n", "[success, years] | \n", "
998 | \n", "how to package cookbook dependencies locally w... | \n", "16 | \n", "[decade, iops, ebs] | \n", "[decade, iops] | \n", "
999 | \n", "now hiring aws solutions architects | \n", "18 | \n", "[years, success] | \n", "[years, success] | \n", "
1000 rows × 4 columns
\n", "