{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![MLU Logo](../data/MLU_Logo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a name=\"0\">Machine Learning Accelerator - Tabular Data - Lecture 2</a>\n",
    "\n",
    "\n",
    "## Text Preprocessing\n",
    "\n",
    "In this notebok we explore techniques to clean and convert text features into numerical features that machine learning algoritms can work with. \n",
    "\n",
    "1. <a href=\"#1\">Common text pre-processing</a>\n",
    "2. <a href=\"#2\">Lexicon-based text processing</a>\n",
    "3. <a href=\"#3\">Feature Extraction - Bag of Words</a>\n",
    "4. <a href=\"#4\">Putting it all together</a>\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mWARNING: You are using pip version 21.3.1; however, version 22.3.1 is available.\n",
      "You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p39/bin/python -m pip install --upgrade pip' command.\u001b[0m\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q -r ../requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. <a name=\"1\">Common text pre-processing</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "In this section, we will do some general purpose text cleaning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  \""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first lowercase our text. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  \n"
     ]
    }
   ],
   "source": [
    "text = text.lower()\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get rid of leading/trailing whitespace with the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .\n"
     ]
    }
   ],
   "source": [
    "text = text.strip()\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove HTML tags/markups:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "this is a message to be cleaned. it may involve some things like: , ?, :, ''  adjacent spaces and tabs     .\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "text = re.compile('<.*?>').sub('', text)\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace punctuation with space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "this is a message to be cleaned  it may involve some things like              adjacent spaces and tabs      \n"
     ]
    }
   ],
   "source": [
    "import re, string\n",
    "\n",
    "text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove extra space and tabs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "this is a message to be cleaned it may involve some things like adjacent spaces and tabs \n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "text = re.sub('\\s+', ' ', text)\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. <a name=\"2\">Lexicon-based text processing</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "In section 1, we saw some general purpose text pre-processing methods. Lexicon based methods are usually used __to normalize sentences in our dataset__ and later in section 3, we will use these normalized sentences for feature extraction. <br/>\n",
    "By normalization, here, __we mean putting words in the sentences into a similar format that will enhance similarities (if any) between sentences__. \n",
    "\n",
    "__Stop word removal:__ There can be some words in our sentences that occur very frequently and don't contribute too much to the overall meaning of the sentences. We usually have a list of these words and remove them from each our sentences. For example: \"a\", \"an\", \"the\", \"this\", \"that\", \"is\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "stop_words = [\"a\", \"an\", \"the\", \"this\", \"that\", \"is\", \"it\", \"to\", \"and\"]\n",
    "\n",
    "filtered_sentence = []\n",
    "words = text.split(\" \")\n",
    "for w in words:\n",
    "    if w not in stop_words:\n",
    "        filtered_sentence.append(w)\n",
    "text = \" \".join(filtered_sentence)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "message be cleaned may involve some things like adjacent spaces tabs \n"
     ]
    }
   ],
   "source": [
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Stemming:__ Stemming is a rule-based system to __convert words into their root form__. <br/>\n",
    "It removes suffixes from words. This helps us enhace similarities (if any) between sentences. \n",
    "\n",
    "Example:\n",
    "\n",
    "\"jumping\", \"jumped\" -> \"jump\"\n",
    "\n",
    "\"cars\" -> \"car\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We use the NLTK library\n",
    "import nltk\n",
    "from nltk.stem import SnowballStemmer\n",
    "\n",
    "# Initialize the stemmer\n",
    "snow = SnowballStemmer('english')\n",
    "\n",
    "stemmed_sentence = []\n",
    "words = text.split(\" \")\n",
    "for w in words:\n",
    "    stemmed_sentence.append(snow.stem(w))\n",
    "text = \" \".join(stemmed_sentence)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "messag be clean may involv some thing like adjac space tab \n"
     ]
    }
   ],
   "source": [
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. <a name=\"3\">Feature Extraction - Bag of Words</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "In this section, we assume we will first apply the common and lexicon based pre-processing to our text. After those, we will convert our text data into numerical data with the __Bag of Words (BoW)__ representation. \n",
    "\n",
    "__Bag of Words (BoW)__: A modeling technique to convert text information into numerical representation. <br/>\n",
    "__Machine learning models expect numerical or categorical values as input and won't work with raw text data__. \n",
    "\n",
    "Steps:\n",
    "1. Create vocabulary of known words\n",
    "2. Measure presence of the known words in sentences\n",
    "\n",
    "Let's seen an interactive example for ourselves:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <script>var BagOfWords=function(){\"use strict\";var pt=document.createElement(\"style\");pt.textContent=`input.svelte-5idota{width:300px;overflow:scroll;height:38px}.description.svelte-5idota{font-size:16px;opacity:.9}#title-div.svelte-5idota{display:flex;align-items:center}section.svelte-5idota{margin:auto;padding-bottom:15px}table.svelte-5idota{font-size:16px;background-color:#fff;border-collapse:collapse}.table-head.svelte-5idota{padding:16px;font-size:16px;text-align:left;border:none}td.svelte-5idota{padding:16px;background-color:#fff;border-bottom:1px solid black;font-size:18px;text-align:center}button.svelte-5idota{padding:6px 10px;font-size:16px;border:1px solid black;text-align:center;margin:2px 8px;font-weight:700}button.svelte-5idota:hover,.active.svelte-5idota{color:#fff;background-color:coral}thead.svelte-5idota{border-bottom:1px solid black}\n",
       "`,document.head.appendChild(pt);function Q(){}function gt(t){return t()}function vt(){return Object.create(null)}function R(t){t.forEach(gt)}function bt(t){return typeof t==\"function\"}function It(t,e){return t!=t?e==e:t!==e||t&&typeof t==\"object\"||typeof t==\"function\"}function Pt(t){return Object.keys(t).length===0}function s(t,e){t.appendChild(e)}function U(t,e,n){t.insertBefore(e,n||null)}function M(t){t.parentNode&&t.parentNode.removeChild(t)}function Z(t,e){for(let n=0;n<t.length;n+=1)t[n]&&t[n].d(e)}function r(t){return document.createElement(t)}function B(t){return document.createTextNode(t)}function m(){return B(\" \")}function V(t,e,n,i){return t.addEventListener(e,n,i),()=>t.removeEventListener(e,n,i)}function h(t,e,n){n==null?t.removeAttribute(e):t.getAttribute(e)!==n&&t.setAttribute(e,n)}function qt(t){return Array.from(t.childNodes)}function tt(t,e){e=\"\"+e,t.wholeText!==e&&(t.data=e)}function D(t,e){t.value=e==null?\"\":e}function et(t,e,n){t.classList[n?\"add\":\"remove\"](e)}let ut;function X(t){ut=t}const Y=[],mt=[],nt=[],kt=[],Ft=Promise.resolve();let ft=!1;function Gt(){ft||(ft=!0,Ft.then(yt))}function dt(t){nt.push(t)}const ht=new Set;let lt=0;function yt(){const t=ut;do{for(;lt<Y.length;){const e=Y[lt];lt++,X(e),Ht(e.$$)}for(X(null),Y.length=0,lt=0;mt.length;)mt.pop()();for(let e=0;e<nt.length;e+=1){const n=nt[e];ht.has(n)||(ht.add(n),n())}nt.length=0}while(Y.length);for(;kt.length;)kt.pop()();ft=!1,ht.clear(),X(t)}function Ht(t){if(t.fragment!==null){t.update(),R(t.before_update);const e=t.dirty;t.dirty=[-1],t.fragment&&t.fragment.p(t.ctx,e),t.after_update.forEach(dt)}}const Jt=new Set;function Kt(t,e){t&&t.i&&(Jt.delete(t),t.i(e))}function Qt(t,e,n,i){const{fragment:c,after_update:u}=t.$$;c&&c.m(e,n),i||dt(()=>{const y=t.$$.on_mount.map(gt).filter(bt);t.$$.on_destroy?t.$$.on_destroy.push(...y):R(y),t.$$.on_mount=[]}),u.forEach(dt)}function Rt(t,e){const n=t.$$;n.fragment!==null&&(R(n.on_destroy),n.fragment&&n.fragment.d(e),n.on_destroy=n.fragment=null,n.ctx=[])}function Ut(t,e){t.$$.dirty[0]===-1&&(Y.push(t),Gt(),t.$$.dirty.fill(0)),t.$$.dirty[e/31|0]|=1<<e%31}function Vt(t,e,n,i,c,u,y,L=[-1]){const d=ut;X(t);const a=t.$$={fragment:null,ctx:[],props:u,update:Q,not_equal:c,bound:vt(),on_mount:[],on_destroy:[],on_disconnect:[],before_update:[],after_update:[],context:new Map(e.context||(d?d.$$.context:[])),callbacks:vt(),dirty:L,skip_bound:!1,root:e.target||d.$$.root};y&&y(a.root);let x=!1;if(a.ctx=n?n(t,e.props||{},(f,$,...k)=>{const W=k.length?k[0]:$;return a.ctx&&c(a.ctx[f],a.ctx[f]=W)&&(!a.skip_bound&&a.bound[f]&&a.bound[f](W),x&&Ut(t,f)),$}):[],a.update(),x=!0,R(a.before_update),a.fragment=i?i(a.ctx):!1,e.target){if(e.hydrate){const f=qt(e.target);a.fragment&&a.fragment.l(f),f.forEach(M)}else a.fragment&&a.fragment.c();e.intro&&Kt(t.$$.fragment),Qt(t,e.target,e.anchor,e.customElement),yt()}X(d)}class Xt{$destroy(){Rt(this,1),this.$destroy=Q}$on(e,n){if(!bt(n))return Q;const i=this.$$.callbacks[e]||(this.$$.callbacks[e]=[]);return i.push(n),()=>{const c=i.indexOf(n);c!==-1&&i.splice(c,1)}}$set(e){this.$$set&&!Pt(e)&&(this.$$.skip_bound=!0,this.$$set(e),this.$$.skip_bound=!1)}}const ee=\"\";function xt(t,e,n){const i=t.slice();return i[15]=e[n],i}function wt(t,e,n){const i=t.slice();return i[15]=e[n],i}function Ct(t,e,n){const i=t.slice();return i[15]=e[n],i}function $t(t,e,n){const i=t.slice();return i[15]=e[n],i}function Et(t){let e,n=t[15]+\"\",i;return{c(){e=r(\"th\"),i=B(n),h(e,\"class\",\"table-head svelte-5idota\")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&16&&n!==(n=c[15]+\"\")&&tt(i,n)},d(c){c&&M(e)}}}function St(t){let e,n=t[7][t[15]]+\"\",i;return{c(){e=r(\"td\"),i=B(n),h(e,\"class\",\"svelte-5idota\")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&144&&n!==(n=c[7][c[15]]+\"\")&&tt(i,n)},d(c){c&&M(e)}}}function Bt(t){let e,n=t[6][t[15]]+\"\",i;return{c(){e=r(\"td\"),i=B(n),h(e,\"class\",\"svelte-5idota\")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&80&&n!==(n=c[6][c[15]]+\"\")&&tt(i,n)},d(c){c&&M(e)}}}function Ot(t){let e,n=t[5][t[15]]+\"\",i;return{c(){e=r(\"td\"),i=B(n),h(e,\"class\",\"svelte-5idota\")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&48&&n!==(n=c[5][c[15]]+\"\")&&tt(i,n)},d(c){c&&M(e)}}}function Yt(t){let e,n,i,c,u,y,L,d,a,x,f,$,k,W,O,I,z,P,ot,it,w,E,S,A,zt,j,At,Lt,q,ct,st,Wt,N,jt,Nt,F,rt,at,Tt,T,Mt,_t,Dt,G=t[4],p=[];for(let o=0;o<G.length;o+=1)p[o]=Et($t(t,G,o));let H=t[4],g=[];for(let o=0;o<H.length;o+=1)g[o]=St(Ct(t,H,o));let J=t[4],v=[];for(let o=0;o<J.length;o+=1)v[o]=Bt(wt(t,J,o));let K=t[4],b=[];for(let o=0;o<K.length;o+=1)b[o]=Ot(xt(t,K,o));return{c(){e=r(\"section\"),n=r(\"h2\"),n.textContent=\"Bag of Words Demo\",i=m(),c=r(\"p\"),c.textContent=`Edit the sentences and watch the corresponding vocabulary and cell-counts\n",
       "    update:`,u=m(),y=r(\"br\"),L=m(),d=r(\"div\"),a=r(\"p\"),a.textContent=\"Count Method:\",x=m(),f=r(\"button\"),f.textContent=\"Count\",$=m(),k=r(\"button\"),k.textContent=\"Binary\",W=m(),O=r(\"table\"),I=r(\"thead\"),z=r(\"tr\"),P=r(\"th\"),ot=m();for(let o=0;o<p.length;o+=1)p[o].c();it=m(),w=r(\"tbody\"),E=r(\"tr\"),S=r(\"td\"),A=r(\"div\"),zt=B(`Sentence 1:\n",
       "            `),j=r(\"input\"),At=m();for(let o=0;o<g.length;o+=1)g[o].c();Lt=m(),q=r(\"tr\"),ct=r(\"td\"),st=r(\"div\"),Wt=B(`Sentence 2:\n",
       "            `),N=r(\"input\"),jt=m();for(let o=0;o<v.length;o+=1)v[o].c();Nt=m(),F=r(\"tr\"),rt=r(\"td\"),at=r(\"div\"),Tt=B(`Sentence 3:\n",
       "            `),T=r(\"input\"),Mt=m();for(let o=0;o<b.length;o+=1)b[o].c();h(c,\"class\",\"description svelte-5idota\"),h(a,\"class\",\"description svelte-5idota\"),h(f,\"class\",\"svelte-5idota\"),et(f,\"active\",t[3]===\"count\"),h(k,\"class\",\"svelte-5idota\"),et(k,\"active\",t[3]===\"ohe\"),h(d,\"id\",\"title-div\"),h(d,\"class\",\"svelte-5idota\"),h(I,\"class\",\"svelte-5idota\"),h(j,\"class\",\"svelte-5idota\"),h(S,\"class\",\"table-head svelte-5idota\"),h(N,\"class\",\"svelte-5idota\"),h(ct,\"class\",\"table-head svelte-5idota\"),h(T,\"class\",\"svelte-5idota\"),h(rt,\"class\",\"table-head svelte-5idota\"),h(O,\"class\",\"svelte-5idota\"),h(e,\"class\",\"svelte-5idota\")},m(o,_){U(o,e,_),s(e,n),s(e,i),s(e,c),s(e,u),s(e,y),s(e,L),s(e,d),s(d,a),s(d,x),s(d,f),s(d,$),s(d,k),s(e,W),s(e,O),s(O,I),s(I,z),s(z,P),s(z,ot);for(let l=0;l<p.length;l+=1)p[l].m(z,null);s(O,it),s(O,w),s(w,E),s(E,S),s(S,A),s(A,zt),s(A,j),D(j,t[0]),s(E,At);for(let l=0;l<g.length;l+=1)g[l].m(E,null);s(w,Lt),s(w,q),s(q,ct),s(ct,st),s(st,Wt),s(st,N),D(N,t[1]),s(q,jt);for(let l=0;l<v.length;l+=1)v[l].m(q,null);s(w,Nt),s(w,F),s(F,rt),s(rt,at),s(at,Tt),s(at,T),D(T,t[2]),s(F,Mt);for(let l=0;l<b.length;l+=1)b[l].m(F,null);_t||(Dt=[V(f,\"click\",t[9]),V(k,\"click\",t[10]),V(j,\"input\",t[11]),V(N,\"input\",t[12]),V(T,\"input\",t[13])],_t=!0)},p(o,[_]){if(_&8&&et(f,\"active\",o[3]===\"count\"),_&8&&et(k,\"active\",o[3]===\"ohe\"),_&16){G=o[4];let l;for(l=0;l<G.length;l+=1){const C=$t(o,G,l);p[l]?p[l].p(C,_):(p[l]=Et(C),p[l].c(),p[l].m(z,null))}for(;l<p.length;l+=1)p[l].d(1);p.length=G.length}if(_&1&&j.value!==o[0]&&D(j,o[0]),_&144){H=o[4];let l;for(l=0;l<H.length;l+=1){const C=Ct(o,H,l);g[l]?g[l].p(C,_):(g[l]=St(C),g[l].c(),g[l].m(E,null))}for(;l<g.length;l+=1)g[l].d(1);g.length=H.length}if(_&2&&N.value!==o[1]&&D(N,o[1]),_&80){J=o[4];let l;for(l=0;l<J.length;l+=1){const C=wt(o,J,l);v[l]?v[l].p(C,_):(v[l]=Bt(C),v[l].c(),v[l].m(q,null))}for(;l<v.length;l+=1)v[l].d(1);v.length=J.length}if(_&4&&T.value!==o[2]&&D(T,o[2]),_&48){K=o[4];let l;for(l=0;l<K.length;l+=1){const C=xt(o,K,l);b[l]?b[l].p(C,_):(b[l]=Ot(C),b[l].c(),b[l].m(F,null))}for(;l<b.length;l+=1)b[l].d(1);b.length=K.length}},i:Q,o:Q,d(o){o&&M(e),Z(p,o),Z(g,o),Z(v,o),Z(b,o),_t=!1,R(Dt)}}}function Zt(t,e,n){let i,c,u,y,L,d=\"I\",a=\"love dogs\",x=\"dogs dogs dogs\",f=\"count\";function $(P,ot,it){let w={},E=ot.replace(/[^\\w\\s]/gi,\"\").toLowerCase();for(const S of P)it===\"count\"?w[S]=E.split(\" \").filter(A=>A===S).length:w[S]=E.split(\" \").filter(A=>A===S).length?1:0;return w}const k=()=>{n(3,f=\"count\")},W=()=>{n(3,f=\"ohe\")};function O(){d=this.value,n(0,d)}function I(){a=this.value,n(1,a)}function z(){x=this.value,n(2,x)}return t.$$.update=()=>{t.$$.dirty&7&&n(8,i=(d+\" \"+a+\" \"+x).replace(/[^\\w\\s]/gi,\"\").toLowerCase()),t.$$.dirty&256&&console.log(\"tokens\",i),t.$$.dirty&256&&n(4,c=Array.from(new Set(i.split(\" \"))).filter(P=>P!==\"\")),t.$$.dirty&16&&console.log(\"vocab\",c),t.$$.dirty&25&&n(7,u=$(c,d,f)),t.$$.dirty&26&&n(6,y=$(c,a,f)),t.$$.dirty&28&&n(5,L=$(c,x,f))},[d,a,x,f,c,L,y,u,i,k,W,O,I,z]}class te extends Xt{constructor(e){super(),Vt(this,e,Zt,Yt,It,{})}}return te}();\n",
       "</script>\n",
       "        \n",
       "        <div id=\"BagOfWords-328cd234\"></div>\n",
       "        <script>\n",
       "        (() => {\n",
       "            var data = {};\n",
       "            window.BagOfWords_data = data;\n",
       "            var BagOfWords_inst = new BagOfWords({\n",
       "                \"target\": document.getElementById(\"BagOfWords-328cd234\"),\n",
       "                \"props\": data\n",
       "            });\n",
       "        })();\n",
       "        </script>\n",
       "        \n",
       "        "
      ],
      "text/plain": [
       "<mluvisuals.BagOfWords at 0x7fe489c05700>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from mluvisuals import *\n",
    "\n",
    "BagOfWords()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the sklearn library's Bag of Words implementation:\n",
    "\n",
    "`from sklearn.feature_extraction.text import CountVectorizer`\n",
    "\n",
    "`countVectorizer = CountVectorizer(binary=True)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "countVectorizer = CountVectorizer(binary=True)\n",
    "\n",
    "sentences = [\n",
    "    'This is the first document.',\n",
    "    'This is the second second document.',\n",
    "    'And the third one.',\n",
    "    'Is this the first document?'\n",
    "]\n",
    "X = countVectorizer.fit_transform(sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's print the vocabulary below. <br/>\n",
    "Each number next to a word shows the index of it in the vocabulary (From 0 to 8 here).<br/>\n",
    "They are alphabetically ordered-> and:0, document:1, first:2, ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}\n"
     ]
    }
   ],
   "source": [
    "print(countVectorizer.vocabulary_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Note:__ Sklearn automatically removes punctuation, but doesn't do the other extra pre-processing methods we discussed here. <br/>\n",
    "Lexicon-based methods are also not automaticaly applied, we need to call those methods before feature extraction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1 1 1 0 0 1 0 1]\n",
      " [0 1 0 1 0 1 1 0 1]\n",
      " [1 0 0 0 1 0 1 1 0]\n",
      " [0 1 1 1 0 0 1 0 1]]\n"
     ]
    }
   ],
   "source": [
    "print(X.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__What happens when we encounter a new word during prediction?__ \n",
    "\n",
    "__New words will be skipped__. <br/>\n",
    "This usually happens when we are making predictions. For our test and validation data/text, we need to use the __.transform()__ function this time. <br/>\n",
    "This simulates a real-time prediction case where we cannot re-train the model quickly whenever we receive new words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1 0 0 0 0 0 0 1]\n",
      " [0 0 0 1 1 0 0 0 1]]\n"
     ]
    }
   ],
   "source": [
    "test_sentences = [\"this document has some new words\",\n",
    "                 \"this one is new too\"]\n",
    "\n",
    "count_vectors = countVectorizer.transform(test_sentences)\n",
    "print(count_vectors.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See that these last two vectors have the same lenght 9 (same vocabulary) like the ones before."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. <a name=\"4\">Putting it all together</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Let's have a full example here. We will apply everything discussed in this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare cleaning functions\n",
    "import re, string\n",
    "import nltk\n",
    "from nltk.stem import SnowballStemmer\n",
    "\n",
    "stop_words = [\"a\", \"an\", \"the\", \"this\", \"that\", \"is\", \"it\", \"to\", \"and\"]\n",
    "\n",
    "stemmer = SnowballStemmer('english')\n",
    "\n",
    "def preProcessText(text):\n",
    "    # lowercase and strip leading/trailing white space\n",
    "    text = text.lower().strip()\n",
    "    \n",
    "    # remove HTML tags\n",
    "    text = re.compile('<.*?>').sub('', text)\n",
    "    \n",
    "    # remove punctuation\n",
    "    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)\n",
    "    \n",
    "    # remove extra white space\n",
    "    text = re.sub('\\s+', ' ', text)\n",
    "    \n",
    "    return text\n",
    "\n",
    "def lexiconProcess(text, stop_words, stemmer):\n",
    "    filtered_sentence = []\n",
    "    words = text.split(\" \")\n",
    "    for w in words:\n",
    "        if w not in stop_words:\n",
    "            filtered_sentence.append(stemmer.stem(w))\n",
    "    text = \" \".join(filtered_sentence)\n",
    "    \n",
    "    return text\n",
    "\n",
    "def cleanSentence(text, stop_words, stemmer):\n",
    "    return lexiconProcess(preProcessText(text), stop_words, stemmer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare vectorizer \n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "textvectorizer = CountVectorizer(binary=True)# can also limit vocabulary size here, with say, max_features=50"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4\n",
      "Vocabulary: \n",
      " {'like': 11, 'materi': 13, 'color': 4, 'overal': 19, 'how': 10, 'look': 12, 'work': 29, 'okay': 18, 'first': 7, 'two': 27, 'time': 26, 'use': 28, 'but': 3, 'third': 24, 'burn': 2, 'my': 15, 'face': 6, 'am': 1, 'not': 17, 'sure': 23, 'about': 0, 'product': 21, 'never': 16, 'thought': 25, 'would': 30, 'pay': 20, 'so': 22, 'much': 14, 'for': 8, 'hair': 9, 'dryer': 5}\n",
      "Bag of Words Binary Features: \n",
      " [[0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]\n",
      " [0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0]\n",
      " [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0]\n",
      " [0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1]]\n",
      "(4, 31)\n"
     ]
    }
   ],
   "source": [
    "# Clean and vectorize a text feature with four samples\n",
    "text_feature = [\"I liked the material, color and overall how it looks.<br /><br />\",\n",
    "             \"Worked okay first two times I used it, but third time burned my face.\",\n",
    "             \"I am not sure about this product.\",\n",
    "             \"I never thought I would pay so much for a hair dryer.\",\n",
    "            ]\n",
    "\n",
    "print(len(text_feature))\n",
    "\n",
    "# Clean up the text\n",
    "text_feature_cleaned = [cleanSentence(item, stop_words, stemmer) for item in text_feature]\n",
    "\n",
    "# Vectorize the cleaned text\n",
    "text_feature_vectorized = textvectorizer.fit_transform(text_feature_cleaned)\n",
    "print('Vocabulary: \\n', textvectorizer.vocabulary_)\n",
    "print('Bag of Words Binary Features: \\n', text_feature_vectorized.toarray())\n",
    "\n",
    "print(text_feature_vectorized.shape)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:conda_pytorch_p39] *",
   "language": "python",
   "name": "conda-env-conda_pytorch_p39-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}