{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning for Telecom with Naive Bayes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Machine Learning for CallDisconnectReason is a notebook which demonstrates exploration of dataset and CallDisconnectReason classification with Spark ml Naive Bayes Algorithm.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql.types import *\n", "from pyspark.sql import SparkSession\n", "from sagemaker import get_execution_role\n", "import sagemaker_pyspark\n", "\n", "\n", "role = get_execution_role()\n", "\n", "# Configure Spark to use the SageMaker Spark dependency jars\n", "jars = sagemaker_pyspark.classpath_jars()\n", "\n", "classpath = \":\".join(sagemaker_pyspark.classpath_jars())\n", "\n", "spark = SparkSession.builder.config(\"spark.driver.extraClassPath\", classpath)\\\n", " .master(\"local[*]\").getOrCreate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using S3 Select, enables applications to retrieve only a subset of data from an object by using simple SQL expressions. By using S3 Select to retrieve only the data, you can achieve drastic performance increases – in many cases you can get as much as a 400% improvement.\n", "\n", "- _We first read a parquet compressed format of CDR dataset using s3select which has already been processed by Glue._\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "cdr_start_loc = \"<%CDRStartFile%>\"\n", "cdr_stop_loc = \"<%CDRStopFile%>\"\n", "cdr_start_sample_loc = \"<%CDRStartSampleFile%>\"\n", "cdr_stop_sample_loc = \"<%CDRStopSampleFile%>\"\n", "\n", "df = spark.read.format(\"s3select\").parquet(cdr_stop_sample_loc)\n", "df.createOrReplaceTempView(\"cdr\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "22413" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "durationDF = spark.sql(\"SELECT _c13 as CallServiceDuration FROM cdr where _c0 = 'STOP'\")\n", "durationDF.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploration of Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- _We see how we can explore and visualize the dataset used for processing. Here we create a bar chart representation of CallServiceDuration from CDR dataset._" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "durationpd = durationDF.toPandas().astype(int) " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "durationpd.plot(kind='bar',stacked=True,width=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- _We can represent the data and visualize with a box plot. The box extends from the lower to upper quartile values of the data, with a line at the median._" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAD/NJREFUeJzt3Xus33V9x/HnSyreZwscO9ZS67QLY14ATxiK2QScEdwGZniLG9V0dBfcWHRzYJYwkl1wXpia6SziVowboGBoCFFIATeXgBS530YlEOiAFigoElDwvT/Op/O3ek7P7/ScXw98+nwkJ7/P9/P5fL/f9+8XeP2+/fxuqSokSf16znwXIEkaLYNekjpn0EtS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1LkF810AwD777FPLly+f7zIk6VnlmmuuebCqxqab94wI+uXLl7Nhw4b5LkOSnlWS3D3MPJduJKlzBr0kdc6gl6TOGfSS1DmDXpI6N1TQJ7kryY1JrkuyofXtleTSJHe020WtP0k+k2RjkhuSHDzKOyBJ2rGZXNEfXlUHVtV42z4ZWF9VK4D1bRvgKGBF+1sNfH6uipUkzdxslm6OAda29lrg2IH+s2vClcDCJPvO4jySpFkY9gNTBVySpIAvVNUaYHFV3dfG7wcWt/YS4J6Bfe9tffcN9JFkNRNX/Cxbtmznqpdm6LTTTtsl5zn11FN3yXmkYQwb9G+qqk1JXgZcmuS2wcGqqvYkMLT2ZLEGYHx83F8o1y4x4wD+ZODD/uepZ7ehlm6qalO73Qx8HTgEeGDbkky73dymbwL2G9h9aeuTJM2DaYM+yYuSvGRbG3grcBOwDljZpq0ELmztdcDx7d03hwKPDizxSJJ2sWGWbhYDX0+ybf6/VdU3klwNnJdkFXA38K42/2LgaGAj8DjwgTmvWpI0tGmDvqruBF43Sf9DwJGT9Bdw4pxUJ0maNT8ZK0mdM+glqXMGvSR1zqCXpM4Z9JLUOYNekjpn0EtS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXMGvSR1zqCXpM4Z9JLUOYNekjo3dNAn2SPJtUkuatuvSHJVko1Jzk2yZ+t/Xtve2MaXj6Z0SdIwZnJFfxJw68D2x4AzqupVwFZgVetfBWxt/We0eZKkeTJU0CdZCrwd+GLbDnAE8LU2ZS1wbGsf07Zp40e2+ZKkeTDsFf0/Ah8BftK29wYeqaqn2va9wJLWXgLcA9DGH23zJUnzYNqgT/KbwOaqumYuT5xkdZINSTZs2bJlLg8tSRowzBX9YcBvJ7kLOIeJJZtPAwuTLGhzlgKbWnsTsB9AG38p8ND2B62qNVU1XlXjY2Njs7oTkqSpTRv0VXVKVS2tquXAe4DLqup9wOXAcW3aSuDC1l7Xtmnjl1VVzWnVkqShzeZ99H8JfCjJRibW4M9q/WcBe7f+DwEnz65ESdJsLJh+yk9V1RXAFa19J3DIJHOeAN45B7VJkuaAn4yVpM4Z9JLUOYNekjpn0EtS52b0Yqz0TLLXSXux9fGtIz1H7Q85YfTf4LHohYt4+NMPj/w82j0Z9HrW2vr4VurM0X9EY1d8CGRXPJlo9+XSjSR1zqCXpM4Z9JLUOYNekjpn0EtS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXMGvSR1btqgT/L8JN9Jcn2Sm5Oc1vpfkeSqJBuTnJtkz9b/vLa9sY0vH+1dkCTtyDBX9E8CR1TV64ADgbclORT4GHBGVb0K2AqsavNXAVtb/xltniRpnkwb9DXhsbb53PZXwBHA11r/WuDY1j6mbdPGj0ySOatYkjQjQ63RJ9kjyXXAZuBS4HvAI1X1VJtyL7CktZcA9wC08UeBvSc55uokG5Js2LJly+zuhSRpSkMFfVU9XVUHAkuBQ4D9Z3viqlpTVeNVNT42Njbbw0mSpjCjd91U1SPA5cAbgIVJFrShpcCm1t4E7AfQxl8KPDQn1UqSZmyYd92MJVnY2i8AfgO4lYnAP65NWwlc2Nrr2jZt/LKqqrksWpI0vAXTT2FfYG2SPZh4Yjivqi5KcgtwTpK/Aa4FzmrzzwK+nGQj8DDwnhHULUka0rRBX1U3AAdN0n8nE+v12/c/AbxzTqqTJM2an4yVpM4Z9JLUOYNekjpn0EtS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknq3DBfUyw9M33x4+SLn5jvKubIx+HM+a5BvTLo9ez1+39BndnHb9rkhAB/Pt9lqFMu3UhS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktS5aYM+yX5JLk9yS5Kbk5zU+vdKcmmSO9rtotafJJ9JsjHJDUkOHvWdkCRNbZgr+qeAD1fVAcChwIlJDgBOBtZX1QpgfdsGOApY0f5WA5+f86olSUObNuir6r6q+m5r/wC4FVgCHAOsbdPWAse29jHA2TXhSmBhkn3nvHJJ0lBmtEafZDlwEHAVsLiq7mtD9wOLW3sJcM/Abve2PknSPBg66JO8GDgf+LOq+v7gWFUVMKMf70yyOsmGJBu2bNkyk10lSTMwVNAneS4TIf+VqrqgdT+wbUmm3W5u/ZuA/QZ2X9r6/p+qWlNV41U1PjY2trP1S5KmMcy7bgKcBdxaVZ8aGFoHrGztlcCFA/3Ht3ffHAo8OrDEI0naxRYMMecw4PeAG5Nc1/o+CpwOnJdkFXA38K42djFwNLAReBz4wJxWLEmakWmDvqq+DWSK4SMnmV/AibOsS5I0R/xkrCR1zqCXpM4Z9JLUOYNekjpn0EtS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXML5rsAaTZyQkZ6/NofcttITwHAohcuGv1JtNsy6PWsVWfW6E/yyeya80gj5NKNJHXOoJekzhn0ktQ5g16SOmfQS1Lnpg36JF9KsjnJTQN9eyW5NMkd7XZR60+SzyTZmOSGJAePsnhJ0vSGuaL/V+Bt2/WdDKyvqhXA+rYNcBSwov2tBj4/N2VKknbWtEFfVf8BPLxd9zHA2tZeCxw70H92TbgSWJhk37kqVpI0czu7Rr+4qu5r7fuBxa29BLhnYN69rU+SNE9m/WJsVRUw448OJlmdZEOSDVu2bJltGZKkKexs0D+wbUmm3W5u/ZuA/QbmLW19P6Oq1lTVeFWNj42N7WQZkqTp7GzQrwNWtvZK4MKB/uPbu28OBR4dWOKRJM2Dab/ULMm/A28G9klyL3AqcDpwXpJVwN3Au9r0i4GjgY3A48AHRlCzJGkGpg36qnrvFENHTjK3gBNnW5Qkae74yVhJ6pxBL0mdM+glqXMGvSR1zqCXpM4Z9JLUOYNekjpn0EtS5wx6SeqcQS9JnTPoJalzBr0kdc6gl6TOGfSS1DmDXpI6Z9BLUucMeknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1LnDHpJ6pxBL0mdM+glqXMGvSR1zqCXpM4Z9JLUOYNekjo3kqBP8rYktyfZmOTkUZxDkjScOQ/6JHsA/wQcBRwAvDfJAXN9HknScEZxRX8IsLGq7qyqHwHnAMeM4DySpCEsGMExlwD3DGzfC/zq9pOSrAZWAyxbtmwEZUg/67TTTpvhHn8NM94HTj311BnvI43KKIJ+KFW1BlgDMD4+XvNVh3YvBrB2R6NYutkE7DewvbT1SZLmwSiC/mpgRZJXJNkTeA+wbgTnkSQNYc6XbqrqqSQfBL4J7AF8qapunuvzSJKGM5I1+qq6GLh4FMeWJM2Mn4yVpM4Z9JLUOYNekjpn0EtS51I1/59VSrIFuHu+65AmsQ/w4HwXIU3h5VU1Nt2kZ0TQS89USTZU1fh81yHNhks3ktQ5g16SOmfQSzu2Zr4LkGbLNXpJ6pxX9JLUOYNeI5Xk55Ock+R7Sa5JcnGSX9rB/Mfa7fIkN7X2C5N8JcmNSW5K8u0kL56j+i5OsnAn9nt/ki1Jrk1yR5JvJnnjXNTUjr8wyR8PbP9Ckq/N1fG1ezHoNTJJAnwduKKqXllVrwdOARbP8FAnAQ9U1Wuq6tXAKuDHM6hjj6nGquroqnpkhvVsc25VHVRVK4DTgQuS/PIM6trRlwouBP4v6Kvqf6rquJ2sU7s5g16jdDjw46r6520dVXU9cG2S9Um+267Sp/tN4X0Z+PGaqrq9qp4ESPK7Sb6T5LokX9gW6kkeS/LJJNcDpyT56rb9k7w5yUWtfVeSfVr7+CQ3JLk+yZdb31iS85Nc3f4Om6zAqrqciRduV7f9rkgy3tr7JLmrtd+fZF2Sy4D1SV48xWNxOvDKdr8+vt2/cJ6f5F/a/GuTHD5w7AuSfKP9K+MfpnlctZuYt58S1G7h1cA1k/Q/Abyjqr7fQvbKJOtq6ncGfAm4JMlxwHpgbVXd0a6e3w0cVlU/TvI54H3A2cCLgKuq6sPtyvnOJC+qqh+2fc4ZPEGSXwH+CnhjVT2YZK829GngjKr6dpJlTPzOwlRX7d8F/mCIx+Vg4LVV9XCr7WceC+Bk4NVVdWCrb/nA/icCVVWvSbJ/e2y2LYcdCBwEPAncnuSzVTX4G87aDRn0mg8B/i7JrwE/YeIH5RcD9082uaquS/KLwFuBtwBXJ3kDcCTw+rYN8AJgc9vtaeD8tv9TSb4B/FZb53478JHtTnME8NWqerDt83DrfwtwQDs+wM/t4PWBTNG/vUsHjj/VY7EjbwI+2+q8LcndwLagX19VjwIkuQV4OWDQ7+YMeo3SzcBk68rvA8aA17cr8buA5+/oQFX1GHABE+vgPwGOBn7ExNX9KZPs8kRVPT2wfQ7wQeBhYENV/WDI+/Ac4NCqemKwcyD4Bx0E3NraT/HTpdHt79sPB9ozfiym8eRA+2n8f1y4Rq/Rugx4XpLV2zqSvJaJq8zNLdgOb9tTSnJYkkWtvSdwABNfgrceOC7Jy9rYXkmmOta3mFgyOYHtlm0Gan1nkr23Hav1XwL8yUAtB05R468zsT5/Zuu6i4l/bcDkT3bbvJTJH4sfAC+ZYp//ZOIJgrZkswy4fQfn0G7OoNfItDX3dwBvycTbK28G/p6Jn5kcT3IjcDxw2zSHeiXwrTb/WmADcH5V3cLEuvolSW4ALmXihdvJankauAg4qt1uP34z8LftPNcDn2pDf9pqvaEthfzhwG7vbi+W/jfwUeB3qmrbFf0ngD9Kci0T34A5la9M9lhU1UPAf2Xi7aQf326fzwHPafucC7x/24vT0mT8ZKwkdc4reknqnEEvSZ0z6CWpcwa9JHXOoJekzhn0ktQ5g16SOmfQS1Ln/hcrk/sucqClCgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "color = dict(boxes='DarkGreen', whiskers='DarkOrange',\n", " medians='DarkBlue', caps='Gray')\n", "\n", "durationpd.plot.box(color=color, sym='r+')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql.functions import col\n", "durationDF = durationDF.withColumn(\"CallServiceDuration\", col(\"CallServiceDuration\").cast(DoubleType())) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- _We can represent the data and visualize the data with histograms partitioned in different bins._" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([48., 0., 0., ..., 0., 0., 44.]),\n", " array([ 1. , 1.02226386, 1.04452773, ..., 499.95547227,\n", " 499.97773614, 500. ]),\n", " )" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADu9JREFUeJzt3FGMXFd9x/HvrzYhNFASJ4tlxVAHYQXloUnQKiQKqiAhKNCK5CGKiBBdVa78AlVQkcBppVaR+hBeCFSqUC1C8QOFhACyFSHANUFVpSqwJgaSmNQmSkQs27tAApQHqOHfhz1Ot+4uM7szs2sffz/S6N5z7pmd/9lMfnt95t5JVSFJOvf93noXIEkaDwNdkjphoEtSJwx0SeqEgS5JnTDQJakTBrokdWJgoCe5MsmhRY+fJ/lgkk1J9ic50raXrEXBkqSlZSU3FiXZABwD3gy8H/hpVd2XZBdwSVV9ZDJlSpIGWWmgvwP4u6q6McnTwFur6niSLcA3q+rK3/X8yy67rLZt2zZSwZJ0vjl48OCPq2pq0LiNK/y57wE+1/Y3V9Xxtn8C2LzUE5LsBHYCvO51r2N2dnaFLylJ57ckzw0zbugPRZNcALwb+MKZx2rhNH/JU/2q2l1V01U1PTU18A+MJGmVVnKVyzuB71TVydY+2ZZaaNu5cRcnSRreSgL9Lv53uQVgHzDT9meAveMqSpK0ckMFepKLgFuALy3qvg+4JckR4O2tLUlaJ0N9KFpVvwQuPaPvJ8DNkyhKkrRy3ikqSZ0w0CWpEwa6JHXCQJekThjoktQJA12SOmGgS1InDHRJ6oSBLkmdMNAlqRMGuiR1wkCXpE4Y6JLUCQNdkjphoEtSJwx0SeqEgS5JnTDQJakTBrokdcJAl6ROGOiS1AkDXZI6MVSgJ7k4ycNJfpDkcJIbkmxKsj/Jkba9ZNLFSpKWN+wZ+ieAr1bVG4GrgcPALuBAVW0HDrS2JGmdDAz0JK8G/hh4AKCqfl1VLwK3AXvasD3A7ZMqUpI02DBn6FcA88A/J3k8yaeSXARsrqrjbcwJYPNST06yM8lsktn5+fnxVC1J+n+GCfSNwJuAT1bVtcAvOWN5paoKqKWeXFW7q2q6qqanpqZGrVeStIxhAv154Pmqeqy1H2Yh4E8m2QLQtnOTKVGSNIyBgV5VJ4AfJbmydd0MPAXsA2Za3wywdyIVSpKGsnHIcX8JfDbJBcAzwJ+z8MfgoSQ7gOeAOydToiRpGEMFelUdAqaXOHTzeMuRJK2Wd4pKUicMdEnqhIEuSZ0w0CWpEwa6JHXCQJekThjoktQJA12SOmGgS1InDHRJ6oSBLumck3uz3iWclQx0SeqEgS5JnTDQJakTBrokdcJAl6ROGOiS1AkDXZI6YaBLUicMdOkc5002Os1Al6ROGOiS1ImNwwxK8izwC+A3wKmqmk6yCXgQ2AY8C9xZVS9MpkxJ0iArOUN/W1VdU1XTrb0LOFBV24EDrS1JWiejLLncBuxp+3uA20cvR5K0WsMGegFfT3Iwyc7Wt7mqjrf9E8DmpZ6YZGeS2SSz8/PzI5YrSVrOUGvowFuq6liS1wD7k/xg8cGqqiS11BOrajewG2B6enrJMZKk0Q11hl5Vx9p2DvgycB1wMskWgLadm1SRkqTBBgZ6kouSvOr0PvAO4AlgHzDThs0AeydVpCRpsGGWXDYDX05yevy/VNVXk3wbeCjJDuA54M7JlSlJGmRgoFfVM8DVS/T/BLh5EkVJklbOO0UlqRMGuiR1wkCXpE4Y6JLUCQNdkjphoEtSJwx0SeqEgS5JnTDQJakTBrokdcJAl6ROGOiS1AkDXZI6YaBrzeTerHcJUtcMdEnqhIEuSZ0w0NeJyw+Sxs1Al6ROGOiS1AkDXZI6YaBLUicMdEnqxNCBnmRDkseTPNLaVyR5LMnRJA8muWByZUqSBlnJGfrdwOFF7Y8C91fVG4AXgB3jLEyStDJDBXqSrcCfAJ9q7QA3AQ+3IXuA2ydRoCRpOMOeoX8c+DDw29a+FHixqk619vPA5Us9McnOJLNJZufn50cqdq1404+kc9HAQE/yp8BcVR1czQtU1e6qmq6q6ampqdX8CEnSEDYOMeZG4N1J3gVcCPwB8Ang4iQb21n6VuDY5MqUJA0y8Ay9qu6pqq1VtQ14D/CNqnov8ChwRxs2A+ydWJWSpIFGuQ79I8BfJTnKwpr6A+MpSZK0GsMsubykqr4JfLPtPwNcN/6SJEmr4Z2ikrSEc/FqNwNdkjphoEtSJwx0SeqEgS5JnTDQJakTBrokdcJAl6ROGOiS1AkDXZI6YaCrK+fi3X3SuBjoktQJA12SOmGgS1InDHRJ6oSBLkmdMNAlqRMGuiR1wkCXpE4Y6JLUCQNdkjoxMNCTXJjkW0m+m+TJJPe2/iuSPJbkaJIHk1ww+XIlScsZ5gz9V8BNVXU1cA1wa5LrgY8C91fVG4AXgB2TK1OSNMjAQK8F/9WaL2uPAm4CHm79e4DbJ1KhJGkoQ62hJ9mQ5BAwB+wHfgi8WFWn2pDngcsnU6IkaRhDBXpV/aaqrgG2AtcBbxz2BZLsTDKbZHZ+fn6VZUqSBlnRVS5V9SLwKHADcHGSje3QVuDYMs/ZXVXTVTU9NTU1UrGSpOUNc5XLVJKL2/4rgFuAwywE+x1t2Aywd1JFSpIG2zh4CFuAPUk2sPAH4KGqeiTJU8Dnk/w98DjwwATrlCQNMDDQq+p7wLVL9D/Dwnq6JOks4J2iktQJA12SOmGgS1InDHRJ6oSBLkmdMNAlqRMGuiR1wkCXpE4Y6JLUCQNdkjphoEtSJwx0SeqEgS5JnTDQJakTBrokdcJAl6ROGOiS1AkDXZI6YaBLUicMdEnqhIEuSZ0w0CWpEwMDPclrkzya5KkkTya5u/VvSrI/yZG2vWTy5UqSljPMGfop4ENVdRVwPfD+JFcBu4ADVbUdONDakqR1MjDQq+p4VX2n7f8COAxcDtwG7GnD9gC3T6pISdJgK1pDT7INuBZ4DNhcVcfboRPA5rFWJklakaEDPckrgS8CH6yqny8+VlUF1DLP25lkNsns/Pz8SMVKkpY3VKAneRkLYf7ZqvpS6z6ZZEs7vgWYW+q5VbW7qqaranpqamocNUuSljDMVS4BHgAOV9XHFh3aB8y0/Rlg7/jLkyQNa+MQY24E3gd8P8mh1vfXwH3AQ0l2AM8Bd06mREnSMAYGelX9O5BlDt883nIkSavlnaKS1AkDXZI6YaBLUicMdEnqhIEuSZ0w0CWpEwa6JHXCQJekThjoktQJA12SOmGgS1InDHRJ6oSBLkmdMNAlqRMGuiR1wkCXpE4Y6JLUCQNdkjphoEtSJwx0SeqEgS5JnTDQJakTAwM9yaeTzCV5YlHfpiT7kxxp20smW6YkaZBhztA/A9x6Rt8u4EBVbQcOtLYkaR0NDPSq+jfgp2d03wbsaft7gNvHXJckaYVWu4a+uaqOt/0TwOYx1SNJWqWRPxStqgJqueNJdiaZTTI7Pz8/6stJkpax2kA/mWQLQNvOLTewqnZX1XRVTU9NTa3y5SRJg6w20PcBM21/Btg7nnIkSas1zGWLnwP+A7gyyfNJdgD3AbckOQK8vbUlSeto46ABVXXXModuHnMtkqQReKeoJHXCQJekThjoktQJA12SOmGgS1InDHRJ6oSBLkmdMNAlqRMGuiR1wkCXpE4Y6JLUCQNdkjphoEtSJwx0SeqEgS5JnTDQJakTBrokdcJAl6ROGOiS1AkDXZI6YaBLUicMdEnqxEiBnuTWJE8nOZpk17iKkiSt3KoDPckG4B+BdwJXAXcluWpchUmSVmaUM/TrgKNV9UxV/Rr4PHDbeMqSJK3UKIF+OfCjRe3nW58kaR2kqlb3xOQO4Naq+ovWfh/w5qr6wBnjdgI7W/NK4OlVvNxlwI9XVei5yzmfH5zz+WHUOf9hVU0NGrRxhBc4Brx2UXtr6/s/qmo3sHuE1yHJbFVNj/IzzjXO+fzgnM8PazXnUZZcvg1sT3JFkguA9wD7xlOWJGmlVn2GXlWnknwA+BqwAfh0VT05tsokSSsyypILVfUV4CtjquV3GWnJ5hzlnM8Pzvn8sCZzXvWHopKks4u3/ktSJ876QO/16wWSfDrJXJInFvVtSrI/yZG2vaT1J8k/tN/B95K8af0qX70kr03yaJKnkjyZ5O7W3+28k1yY5FtJvtvmfG/rvyLJY21uD7YLC0jy8tY+2o5vW8/6VyvJhiSPJ3mktbueL0CSZ5N8P8mhJLOtb03f22d1oHf+9QKfAW49o28XcKCqtgMHWhsW5r+9PXYCn1yjGsftFPChqroKuB54f/vv2fO8fwXcVFVXA9cAtya5HvgocH9VvQF4AdjRxu8AXmj997dx56K7gcOL2r3P97S3VdU1iy5RXNv3dlWdtQ/gBuBri9r3APesd11jnN824IlF7aeBLW1/C/B02/8n4K6lxp3LD2AvcMv5Mm/g94HvAG9m4SaTja3/pfc5C1eN3dD2N7ZxWe/aVzjPrSyE103AI0B6nu+ieT8LXHZG35q+t8/qM3TOv68X2FxVx9v+CWBz2+/u99D+aX0t8Bidz7stPxwC5oD9wA+BF6vqVBuyeF4vzbkd/xlw6dpWPLKPAx8Gftval9L3fE8r4OtJDrY75GGN39sjXbaoyamqStLlJUhJXgl8EfhgVf08yUvHepx3Vf0GuCbJxcCXgTeuc0kTk+RPgbmqOpjkretdzxp7S1UdS/IaYH+SHyw+uBbv7bP9DH2orxfoyMkkWwDadq71d/N7SPIyFsL8s1X1pdbd/bwBqupF4FEWlhwuTnL6hGrxvF6aczv+auAna1zqKG4E3p3kWRa+gfUm4BP0O9+XVNWxtp1j4Q/3dazxe/tsD/Tz7esF9gEzbX+GhTXm0/1/1j4Zvx742aJ/xp0zsnAq/gBwuKo+tuhQt/NOMtXOzEnyChY+MzjMQrDf0YadOefTv4s7gG9UW2Q9F1TVPVW1taq2sfD/6zeq6r10Ot/TklyU5FWn94F3AE+w1u/t9f4gYYgPGt4F/CcL645/s971jHFenwOOA//NwvrZDhbWDg8AR4B/BTa1sWHhap8fAt8Hpte7/lXO+S0srDN+DzjUHu/qed7AHwGPtzk/Afxt63898C3gKPAF4OWt/8LWPtqOv3695zDC3N8KPHI+zLfN77vt8eTprFrr97Z3ikpSJ872JRdJ0pAMdEnqhIEuSZ0w0CWpEwa6JHXCQJekThjoktQJA12SOvE//nYp6uWc4sYAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "bins, counts = durationDF.select('CallServiceDuration').rdd.flatMap(lambda x: x).histogram(durationDF.count())\n", "plt.hist(bins[:-1], bins=bins, weights=counts,color=['green'])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------+--------------+-------------+--------------------+\n", "| Accounting_ID|Calling_Number|Called_Number|CallDisconnectReason|\n", "+------------------+--------------+-------------+--------------------+\n", "|0x00016E0F5BDACAF7| 9645000046| 3512000046| 16|\n", "|0x00016E0F36A4A836| 9645000048| 3512000048| 16|\n", "|0x00016E0F4C261126| 9645000050| 3512000050| 16|\n", "|0x00016E0F4A446638| 9645000052| 3512000052| 16|\n", "|0x00016E0F4040CE81| 9645000054| 3512000054| 16|\n", "|0x00016E0F4D522D63| 9645000055| 3512000055| 16|\n", "|0x00016E0F5854A088| 9645000057| 3512000057| 16|\n", "|0x00016E0F7DFDA482| 9645000060| 3512000060| 16|\n", "|0x00016E0F65D65F76| 9645000062| 3512000062| 16|\n", "|0x00016E0F2378A4AE| 9645000064| 3512000064| 16|\n", "|0x00016E0F5003BC72| 9645000066| 3512000066| 16|\n", "| 0x00016E0F44702AB| 9645000067| 3512000067| 16|\n", "|0x00016E0F500EED75| 9645000069| 3512000069| 16|\n", "|0x00016E0F38D99C7D| 9645000071| 3512000071| 16|\n", "|0x00016E0F4D14C078| 9645000074| 3512000074| 16|\n", "|0x00016E0F4116E96C| 9645000075| 3512000075| 16|\n", "|0x00016E0F1F5CDE40| 9645000077| 3512000077| 16|\n", "|0x00016E0F1BFE3E2A| 9645000079| 3512000079| 16|\n", "|0x00016E0F7E203CC9| 9645000081| 3512000081| 16|\n", "| 0x00016E0F5B43F12| 9645000084| 3512000084| 16|\n", "+------------------+--------------+-------------+--------------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "sqlDF = spark.sql(\"SELECT _c2 as Accounting_ID, _c19 as Calling_Number,_c20 as Called_Number, _c14 as CallDisconnectReason FROM cdr where _c0 = 'STOP'\")\n", "sqlDF.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Featurization " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from pyspark.ml.feature import StringIndexer\n", "\n", "accountIndexer = StringIndexer(inputCol=\"Accounting_ID\", outputCol=\"AccountingIDIndex\")\n", "accountIndexer.setHandleInvalid(\"skip\")\n", "tempdf1 = accountIndexer.fit(sqlDF).transform(sqlDF)\n", "\n", "callingNumberIndexer = StringIndexer(inputCol=\"Calling_Number\", outputCol=\"Calling_NumberIndex\")\n", "callingNumberIndexer.setHandleInvalid(\"skip\")\n", "tempdf2 = callingNumberIndexer.fit(tempdf1).transform(tempdf1)\n", "\n", "calledNumberIndexer = StringIndexer(inputCol=\"Called_Number\", outputCol=\"Called_NumberIndex\")\n", "calledNumberIndexer.setHandleInvalid(\"skip\")\n", "tempdf3 = calledNumberIndexer.fit(tempdf2).transform(tempdf2)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "StringIndexer_16798a0b6913" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pyspark.ml.feature import StringIndexer\n", "# Convert target into numerical categories\n", "labelIndexer = StringIndexer(inputCol=\"CallDisconnectReason\", outputCol=\"label\")\n", "labelIndexer.setHandleInvalid(\"skip\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(16944, 5469)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pyspark.sql.functions import rand\n", "\n", "trainingFraction = 0.75; \n", "testingFraction = (1-trainingFraction);\n", "seed = 1234;\n", "\n", "trainData, testData = tempdf3.randomSplit([trainingFraction, testingFraction], seed=seed);\n", "\n", "# CACHE TRAIN AND TEST DATA\n", "trainData.cache()\n", "testData.cache()\n", "trainData.count(),testData.count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyzing the label distribution\n", "\n", "- We analyze the distribution of our target labels using a histogram where 16 represents Normal_Call_Clearing." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "negcount = trainData.filter(\"CallDisconnectReason != 16\").count()\n", "poscount = trainData.filter(\"CallDisconnectReason == 16\").count()\n", "\n", "negfrac = 100*float(negcount)/float(negcount+poscount)\n", "posfrac = 100*float(poscount)/float(poscount+negcount)\n", "ind = [0.0,1.0]\n", "frac = [negfrac,posfrac]\n", "width = 0.35\n", "\n", "plt.title('Label Distribution')\n", "plt.bar(ind, frac, width, color='r')\n", "plt.xlabel(\"CallDisconnectReason\")\n", "plt.ylabel('Percentage share')\n", "plt.xticks(ind,['0.0','1.0'])\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "negcount = testData.filter(\"CallDisconnectReason != 16\").count()\n", "poscount = testData.filter(\"CallDisconnectReason == 16\").count()\n", "\n", "negfrac = 100*float(negcount)/float(negcount+poscount)\n", "posfrac = 100*float(poscount)/float(poscount+negcount)\n", "ind = [0.0,1.0]\n", "frac = [negfrac,posfrac]\n", "width = 0.35\n", "\n", "plt.title('Label Distribution')\n", "plt.bar(ind, frac, width, color='r')\n", "plt.xlabel(\"CallDisconnectReason\")\n", "plt.ylabel('Percentage share')\n", "plt.xticks(ind,['0.0','1.0'])\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from pyspark.ml.feature import VectorAssembler\n", "\n", "from pyspark.ml.feature import VectorAssembler\n", "\n", "vecAssembler = VectorAssembler(inputCols=[\"AccountingIDIndex\",\"Calling_NumberIndex\", \"Called_NumberIndex\"], outputCol=\"features\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Spark ML Naive Bayes__: \n", " Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to the training data, it computes the conditional probability distribution of each feature given label, and then it applies Bayes’ theorem to compute the conditional probability distribution of label given an observation and use it for prediction.\n", "\n", "\n", "\n", "- _We use Spark ML Naive Bayes Algorithm and spark Pipeline to train the data set._" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from pyspark.ml.classification import NaiveBayes\n", "from pyspark.ml.clustering import KMeans\n", "from pyspark.ml import Pipeline\n", "\n", "# Train a NaiveBayes model\n", "nb = NaiveBayes(smoothing=1.0, modelType=\"multinomial\")\n", "\n", "# Chain labelIndexer, vecAssembler and NBmodel in a \n", "pipeline = Pipeline(stages=[labelIndexer,vecAssembler, nb])\n", "\n", "# Run stages in pipeline and train model\n", "model = pipeline.fit(trainData)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- Accounting_ID: string (nullable = true)\n", " |-- Calling_Number: string (nullable = true)\n", " |-- Called_Number: string (nullable = true)\n", " |-- CallDisconnectReason: string (nullable = true)\n", " |-- AccountingIDIndex: double (nullable = false)\n", " |-- Calling_NumberIndex: double (nullable = false)\n", " |-- Called_NumberIndex: double (nullable = false)\n", " |-- label: double (nullable = false)\n", " |-- features: vector (nullable = true)\n", " |-- rawPrediction: vector (nullable = true)\n", " |-- probability: vector (nullable = true)\n", " |-- prediction: double (nullable = false)\n", "\n", "+------------------+--------------+-------------+--------------------+-----------------+-------------------+------------------+-----+-------------------+--------------------+-----------+----------+\n", "| Accounting_ID|Calling_Number|Called_Number|CallDisconnectReason|AccountingIDIndex|Calling_NumberIndex|Called_NumberIndex|label| features| rawPrediction|probability|prediction|\n", "+------------------+--------------+-------------+--------------------+-----------------+-------------------+------------------+-----+-------------------+--------------------+-----------+----------+\n", "| 0x00016E0F1005CE4| 9645000075| 3512000075| 16| 2577.0| 38.0| 39.0| 0.0| [2577.0,38.0,39.0]|[-440.84691368645...| [1.0]| 0.0|\n", "|0x00016E0F100A017D| 9645000010| 3512000010| 16| 21710.0| 33.0| 35.0| 0.0|[21710.0,33.0,35.0]|[-560.2845426594596]| [1.0]| 0.0|\n", "|0x00016E0F100AF60A| 9645000077| 3512000077| 16| 6832.0| 34.0| 45.0| 0.0| [6832.0,34.0,45.0]|[-489.13697166313...| [1.0]| 0.0|\n", "|0x00016E0F104511E6| 9645000059| 3512000059| 16| 9768.0| 25.0| 21.0| 0.0| [9768.0,25.0,21.0]|[-335.75198901002...| [1.0]| 0.0|\n", "|0x00016E0F107E0142| 9645000038| 3512000038| 16| 13013.0| 12.0| 11.0| 0.0|[13013.0,12.0,11.0]| [-239.387649472332]| [1.0]| 0.0|\n", "|0x00016E0F107F6253| 9645000093| 3512000093| 16| 13936.0| 97.0| 98.0| 0.0|[13936.0,97.0,98.0]|[-1181.6164141280...| [1.0]| 0.0|\n", "|0x00016E0F109060DA| 9645000068| 3512000068| 16| 13255.0| 21.0| 13.0| 0.0|[13255.0,21.0,13.0]| [-301.258614162017]| [1.0]| 0.0|\n", "|0x00016E0F109962EA| 9645000014| 3512000014| 16| 12198.0| 57.0| 56.0| 0.0|[12198.0,57.0,56.0]|[-720.9963093850602]| [1.0]| 0.0|\n", "|0x00016E0F10AD5AA3| 9645000077| 3512000077| 16| 17980.0| 34.0| 45.0| 0.0|[17980.0,34.0,45.0]|[-587.2075683476078]| [1.0]| 0.0|\n", "|0x00016E0F10B7685D| 9645000080| 3512000080| 16| 19118.0| 51.0| 62.0| 0.0|[19118.0,51.0,62.0]|[-781.8683149993153]| [1.0]| 0.0|\n", "|0x00016E0F10BF7B70| 9645000092| 3512000092| 16| 13766.0| 18.0| 20.0| 0.0|[13766.0,18.0,20.0]|[-327.4738938976721]| [1.0]| 0.0|\n", "|0x00016E0F10C18111| 9645000037| 3512000037| 16| 4031.0| 83.0| 85.0| 0.0| [4031.0,83.0,85.0]|[-947.8468165201091]| [1.0]| 0.0|\n", "|0x00016E0F11135782| 9645000025| 3512000025| 16| 4836.0| 35.0| 32.0| 0.0| [4836.0,35.0,32.0]|[-406.41238288465...| [1.0]| 0.0|\n", "|0x00016E0F113C9E41| 9645000074| 3512000074| 16| 15726.0| 85.0| 83.0| 0.0|[15726.0,85.0,83.0]|[-1050.7308703158...| [1.0]| 0.0|\n", "|0x00016E0F11525D52| 9645000023| 3512000023| 16| 15228.0| 62.0| 63.0| 0.0|[15228.0,62.0,63.0]|[-812.8214011632385]| [1.0]| 0.0|\n", "|0x00016E0F11533B86| 9645000023| 3512000023| 16| 899.0| 62.0| 63.0| 0.0| [899.0,62.0,63.0]|[-686.7670793214941]| [1.0]| 0.0|\n", "|0x00016E0F1156E519| 9645000047| 3512000047| 16| 14638.0| 42.0| 34.0| 0.0|[14638.0,42.0,34.0]|[-541.5216249696589]| [1.0]| 0.0|\n", "|0x00016E0F1159A234| 9645000005| 3512000005| 16| 11407.0| 16.0| 26.0| 0.0|[11407.0,16.0,26.0]|[-328.44207005065...| [1.0]| 0.0|\n", "|0x00016E0F11780902| 9645000099| 3512000099| 16| 19361.0| 27.0| 27.0| 0.0|[19361.0,27.0,27.0]|[-463.58856733144...| [1.0]| 0.0|\n", "|0x00016E0F117D2932| 9645000080| 3512000080| 16| 13710.0| 51.0| 62.0| 0.0|[13710.0,51.0,62.0]|[-734.2933430877965]| [1.0]| 0.0|\n", "+------------------+--------------+-------------+--------------------+-----------------+-------------------+------------------+-----+-------------------+--------------------+-----------+----------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ " # Run inference on the test data and show some results\n", "predictions = model.transform(testData)\n", "predictions.printSchema()\n", "predictions.show()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "predictiondf = predictions.select(\"label\", \"prediction\", \"probability\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "pddf_pred = predictions.toPandas()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Accounting_IDCalling_NumberCalled_NumberCallDisconnectReasonAccountingIDIndexCalling_NumberIndexCalled_NumberIndexlabelfeaturesrawPredictionprobabilityprediction
00x00016E0F1005CE496450000753512000075162577.038.039.00.0[2577.0, 38.0, 39.0][-440.84691368645406][1.0]0.0
10x00016E0F100A017D964500001035120000101621710.033.035.00.0[21710.0, 33.0, 35.0][-560.2845426594596][1.0]0.0
20x00016E0F100AF60A96450000773512000077166832.034.045.00.0[6832.0, 34.0, 45.0][-489.13697166313796][1.0]0.0
30x00016E0F104511E696450000593512000059169768.025.021.00.0[9768.0, 25.0, 21.0][-335.75198901002926][1.0]0.0
40x00016E0F107E0142964500003835120000381613013.012.011.00.0[13013.0, 12.0, 11.0][-239.387649472332][1.0]0.0
50x00016E0F107F6253964500009335120000931613936.097.098.00.0[13936.0, 97.0, 98.0][-1181.6164141280706][1.0]0.0
60x00016E0F109060DA964500006835120000681613255.021.013.00.0[13255.0, 21.0, 13.0][-301.258614162017][1.0]0.0
70x00016E0F109962EA964500001435120000141612198.057.056.00.0[12198.0, 57.0, 56.0][-720.9963093850602][1.0]0.0
80x00016E0F10AD5AA3964500007735120000771617980.034.045.00.0[17980.0, 34.0, 45.0][-587.2075683476078][1.0]0.0
90x00016E0F10B7685D964500008035120000801619118.051.062.00.0[19118.0, 51.0, 62.0][-781.8683149993153][1.0]0.0
100x00016E0F10BF7B70964500009235120000921613766.018.020.00.0[13766.0, 18.0, 20.0][-327.4738938976721][1.0]0.0
110x00016E0F10C1811196450000373512000037164031.083.085.00.0[4031.0, 83.0, 85.0][-947.8468165201091][1.0]0.0
120x00016E0F1113578296450000253512000025164836.035.032.00.0[4836.0, 35.0, 32.0][-406.41238288465064][1.0]0.0
130x00016E0F113C9E41964500007435120000741615726.085.083.00.0[15726.0, 85.0, 83.0][-1050.7308703158105][1.0]0.0
140x00016E0F11525D52964500002335120000231615228.062.063.00.0[15228.0, 62.0, 63.0][-812.8214011632385][1.0]0.0
150x00016E0F11533B869645000023351200002316899.062.063.00.0[899.0, 62.0, 63.0][-686.7670793214941][1.0]0.0
160x00016E0F1156E519964500004735120000471614638.042.034.00.0[14638.0, 42.0, 34.0][-541.5216249696589][1.0]0.0
170x00016E0F1159A234964500000535120000051611407.016.026.00.0[11407.0, 16.0, 26.0][-328.44207005065533][1.0]0.0
180x00016E0F11780902964500009935120000991619361.027.027.00.0[19361.0, 27.0, 27.0][-463.58856733144796][1.0]0.0
190x00016E0F117D2932964500008035120000801613710.051.062.00.0[13710.0, 51.0, 62.0][-734.2933430877965][1.0]0.0
200x00016E0F1195D573964500001135120000111618602.071.075.00.0[18602.0, 71.0, 75.0][-956.5501906528943][1.0]0.0
210x00016E0F11A22AE0964500002735120000271616454.048.042.00.0[16454.0, 48.0, 42.0][-633.528720854468][1.0]0.0
220x00016E0F11A40325964500008535120000851620684.091.091.00.0[20684.0, 91.0, 91.0][-1170.3786026181476][1.0]0.0
230x00016E0F11A6148B964500008935120000891614327.079.066.00.0[14327.0, 79.0, 66.0][-913.5175409333822][1.0]0.0
240x00016E0F11C2572396450000823512000082162666.010.05.00.0[2666.0, 10.0, 5.0][-104.9180221824804][1.0]0.0
250x00016E0F11CB77B796450000773512000077163722.034.045.00.0[3722.0, 34.0, 45.0][-461.7778439551454][1.0]0.0
260x00016E0F121E40C5964500007635120000761618888.020.023.00.0[18888.0, 20.0, 23.0][-399.6868792524099][1.0]0.0
270x00016E0F12302D02964500002135120000211614365.037.036.00.0[14365.0, 37.0, 36.0][-522.8249118161635][1.0]0.0
280x00016E0F1240F35C964500007235120000721620227.094.094.00.0[20227.0, 94.0, 94.0][-1198.9435286840064][1.0]0.0
290x00016E0F124BDC0A964500006135120000611619564.017.022.00.0[19564.0, 17.0, 22.0][-383.90956038827][1.0]0.0
.......................................
54390x00016E0F74CB70E964500002535120000251611120.035.032.00.0[11120.0, 35.0, 32.0][-461.69365571970695][1.0]0.0
54400x00016E0F7540A871964500001235120000121616886.044.044.00.0[16886.0, 44.0, 44.0][-626.46522124715][1.0]0.0
54410x00016E0F75C6412296450000413512000041163061.078.068.00.0[3061.0, 78.0, 68.0][-819.8386880641135][1.0]0.0
54420x00016E0F75D07FED96450000913512000091168471.032.047.00.0[8471.0, 32.0, 47.0][-503.55407827205806][1.0]0.0
54430x00016E0F7743C746964500003235120000321613560.095.092.00.0[13560.0, 95.0, 92.0][-1134.8631413000821][1.0]0.0
54440x00016E0F77A9767B964500004735120000471619922.042.034.00.0[19922.0, 42.0, 34.0][-588.0057506317273][1.0]0.0
54450x00016E0F7832FE4B964500009235120000921618021.018.020.00.0[18021.0, 18.0, 20.0][-364.9057551187359][1.0]0.0
54460x00016E0F785E5B4396450000863512000086164949.08.09.00.0[4949.0, 8.0, 9.0][-135.86152354163926][1.0]0.0
54470x00016E0F78C17B2696450000323512000032169308.095.092.00.0[9308.0, 95.0, 92.0][-1097.4576715205371][1.0]0.0
54480x00016E0F7ABFB0CC96450000423512000042163627.061.051.00.0[3627.0, 61.0, 51.0][-640.1682801951773][1.0]0.0
54490x00016E0F7AD3552296450000933512000093162664.097.098.00.0[2664.0, 97.0, 98.0][-1082.4549711941504][1.0]0.0
54500x00016E0F7B33B8F2964500007835120000781620701.050.055.00.0[20701.0, 50.0, 55.0][-752.3493622870137][1.0]0.0
54510x00016E0F7C39611E964500008235120000821610148.010.05.00.0[10148.0, 10.0, 5.0][-170.73827733077636][1.0]0.0
54520x00016E0F7CEE793096450000253512000025166866.035.032.00.0[6866.0, 35.0, 32.0][-424.2705916458162][1.0]0.0
54530x00016E0F7F06CE4D96450000403512000040168126.01.02.00.0[8126.0, 1.0, 2.0][-87.77787468775551][1.0]0.0
54540x00016E0F7FB6CB21964500006735120000671615652.030.043.00.0[15652.0, 30.0, 43.0][-534.141878601174][1.0]0.0
54550x00016E0F7FDE97BA964500003735120000371613825.083.085.00.0[13825.0, 83.0, 85.0][-1034.0060759323533][1.0]0.0
54560x00016E0F81D602996450000933512000093169733.097.098.00.0[9733.0, 97.0, 98.0][-1144.642004560002][1.0]0.0
54570x00016E0F8525DAD964500000735120000071611009.058.053.00.0[11009.0, 58.0, 53.0][-699.6761782293465][1.0]0.0
54580x00016E0F8C69FA2964500003735120000371619237.083.085.00.0[19237.0, 83.0, 85.0][-1081.6162364325642][1.0]0.0
54590x00016E0F96643EE964500005335120000531619223.011.04.00.0[19223.0, 11.0, 4.0][-250.57309672944564][1.0]0.0
54600x00016E0FA5340F496450000523512000052161419.066.069.00.0[1419.0, 66.0, 69.0][-745.6495909208345][1.0]0.0
54610x00016E0FA9AB2FC96450000463512000046164772.040.028.00.0[4772.0, 40.0, 28.0][-411.28342547001455][1.0]0.0
54620x00016E0FB04F0A496450000013512000001163785.077.080.00.0[3785.0, 77.0, 80.0][-885.9427896531429][1.0]0.0
54630x00016E0FD782A3996450000813512000081169729.088.090.00.0[9729.0, 88.0, 90.0][-1052.281664984985][1.0]0.0
54640x00016E0FE9352479645000009351200000916593.074.072.00.0[593.0, 74.0, 72.0][-798.1244936259648][1.0]0.0
54650x00016E0FEAE927C96450000303512000030168220.086.084.00.0[8220.0, 86.0, 84.0][-995.5612244100009][1.0]0.0
54660x00016E0FFB2D264964500007835120000781620823.050.055.00.0[20823.0, 50.0, 55.0][-753.4226142421182][1.0]0.0
54670x00016E0FFB333EE96450000443512000044163623.056.059.00.0[3623.0, 56.0, 59.0][-656.4210955437193][1.0]0.0
54680x00016E0FFDB7D14964500003635120000361615533.070.070.00.0[15533.0, 70.0, 70.0][-896.9679412626872][1.0]0.0
\n", "

5469 rows × 12 columns

\n", "
" ], "text/plain": [ " Accounting_ID Calling_Number Called_Number CallDisconnectReason \\\n", "0 0x00016E0F1005CE4 9645000075 3512000075 16 \n", "1 0x00016E0F100A017D 9645000010 3512000010 16 \n", "2 0x00016E0F100AF60A 9645000077 3512000077 16 \n", "3 0x00016E0F104511E6 9645000059 3512000059 16 \n", "4 0x00016E0F107E0142 9645000038 3512000038 16 \n", "5 0x00016E0F107F6253 9645000093 3512000093 16 \n", "6 0x00016E0F109060DA 9645000068 3512000068 16 \n", "7 0x00016E0F109962EA 9645000014 3512000014 16 \n", "8 0x00016E0F10AD5AA3 9645000077 3512000077 16 \n", "9 0x00016E0F10B7685D 9645000080 3512000080 16 \n", "10 0x00016E0F10BF7B70 9645000092 3512000092 16 \n", "11 0x00016E0F10C18111 9645000037 3512000037 16 \n", "12 0x00016E0F11135782 9645000025 3512000025 16 \n", "13 0x00016E0F113C9E41 9645000074 3512000074 16 \n", "14 0x00016E0F11525D52 9645000023 3512000023 16 \n", "15 0x00016E0F11533B86 9645000023 3512000023 16 \n", "16 0x00016E0F1156E519 9645000047 3512000047 16 \n", "17 0x00016E0F1159A234 9645000005 3512000005 16 \n", "18 0x00016E0F11780902 9645000099 3512000099 16 \n", "19 0x00016E0F117D2932 9645000080 3512000080 16 \n", "20 0x00016E0F1195D573 9645000011 3512000011 16 \n", "21 0x00016E0F11A22AE0 9645000027 3512000027 16 \n", "22 0x00016E0F11A40325 9645000085 3512000085 16 \n", "23 0x00016E0F11A6148B 9645000089 3512000089 16 \n", "24 0x00016E0F11C25723 9645000082 3512000082 16 \n", "25 0x00016E0F11CB77B7 9645000077 3512000077 16 \n", "26 0x00016E0F121E40C5 9645000076 3512000076 16 \n", "27 0x00016E0F12302D02 9645000021 3512000021 16 \n", "28 0x00016E0F1240F35C 9645000072 3512000072 16 \n", "29 0x00016E0F124BDC0A 9645000061 3512000061 16 \n", "... ... ... ... ... \n", "5439 0x00016E0F74CB70E 9645000025 3512000025 16 \n", "5440 0x00016E0F7540A871 9645000012 3512000012 16 \n", "5441 0x00016E0F75C64122 9645000041 3512000041 16 \n", "5442 0x00016E0F75D07FED 9645000091 3512000091 16 \n", "5443 0x00016E0F7743C746 9645000032 3512000032 16 \n", "5444 0x00016E0F77A9767B 9645000047 3512000047 16 \n", "5445 0x00016E0F7832FE4B 9645000092 3512000092 16 \n", "5446 0x00016E0F785E5B43 9645000086 3512000086 16 \n", "5447 0x00016E0F78C17B26 9645000032 3512000032 16 \n", "5448 0x00016E0F7ABFB0CC 9645000042 3512000042 16 \n", "5449 0x00016E0F7AD35522 9645000093 3512000093 16 \n", "5450 0x00016E0F7B33B8F2 9645000078 3512000078 16 \n", "5451 0x00016E0F7C39611E 9645000082 3512000082 16 \n", "5452 0x00016E0F7CEE7930 9645000025 3512000025 16 \n", "5453 0x00016E0F7F06CE4D 9645000040 3512000040 16 \n", "5454 0x00016E0F7FB6CB21 9645000067 3512000067 16 \n", "5455 0x00016E0F7FDE97BA 9645000037 3512000037 16 \n", "5456 0x00016E0F81D6029 9645000093 3512000093 16 \n", "5457 0x00016E0F8525DAD 9645000007 3512000007 16 \n", "5458 0x00016E0F8C69FA2 9645000037 3512000037 16 \n", "5459 0x00016E0F96643EE 9645000053 3512000053 16 \n", "5460 0x00016E0FA5340F4 9645000052 3512000052 16 \n", "5461 0x00016E0FA9AB2FC 9645000046 3512000046 16 \n", "5462 0x00016E0FB04F0A4 9645000001 3512000001 16 \n", "5463 0x00016E0FD782A39 9645000081 3512000081 16 \n", "5464 0x00016E0FE935247 9645000009 3512000009 16 \n", "5465 0x00016E0FEAE927C 9645000030 3512000030 16 \n", "5466 0x00016E0FFB2D264 9645000078 3512000078 16 \n", "5467 0x00016E0FFB333EE 9645000044 3512000044 16 \n", "5468 0x00016E0FFDB7D14 9645000036 3512000036 16 \n", "\n", " AccountingIDIndex Calling_NumberIndex Called_NumberIndex label \\\n", "0 2577.0 38.0 39.0 0.0 \n", "1 21710.0 33.0 35.0 0.0 \n", "2 6832.0 34.0 45.0 0.0 \n", "3 9768.0 25.0 21.0 0.0 \n", "4 13013.0 12.0 11.0 0.0 \n", "5 13936.0 97.0 98.0 0.0 \n", "6 13255.0 21.0 13.0 0.0 \n", "7 12198.0 57.0 56.0 0.0 \n", "8 17980.0 34.0 45.0 0.0 \n", "9 19118.0 51.0 62.0 0.0 \n", "10 13766.0 18.0 20.0 0.0 \n", "11 4031.0 83.0 85.0 0.0 \n", "12 4836.0 35.0 32.0 0.0 \n", "13 15726.0 85.0 83.0 0.0 \n", "14 15228.0 62.0 63.0 0.0 \n", "15 899.0 62.0 63.0 0.0 \n", "16 14638.0 42.0 34.0 0.0 \n", "17 11407.0 16.0 26.0 0.0 \n", "18 19361.0 27.0 27.0 0.0 \n", "19 13710.0 51.0 62.0 0.0 \n", "20 18602.0 71.0 75.0 0.0 \n", "21 16454.0 48.0 42.0 0.0 \n", "22 20684.0 91.0 91.0 0.0 \n", "23 14327.0 79.0 66.0 0.0 \n", "24 2666.0 10.0 5.0 0.0 \n", "25 3722.0 34.0 45.0 0.0 \n", "26 18888.0 20.0 23.0 0.0 \n", "27 14365.0 37.0 36.0 0.0 \n", "28 20227.0 94.0 94.0 0.0 \n", "29 19564.0 17.0 22.0 0.0 \n", "... ... ... ... ... \n", "5439 11120.0 35.0 32.0 0.0 \n", "5440 16886.0 44.0 44.0 0.0 \n", "5441 3061.0 78.0 68.0 0.0 \n", "5442 8471.0 32.0 47.0 0.0 \n", "5443 13560.0 95.0 92.0 0.0 \n", "5444 19922.0 42.0 34.0 0.0 \n", "5445 18021.0 18.0 20.0 0.0 \n", "5446 4949.0 8.0 9.0 0.0 \n", "5447 9308.0 95.0 92.0 0.0 \n", "5448 3627.0 61.0 51.0 0.0 \n", "5449 2664.0 97.0 98.0 0.0 \n", "5450 20701.0 50.0 55.0 0.0 \n", "5451 10148.0 10.0 5.0 0.0 \n", "5452 6866.0 35.0 32.0 0.0 \n", "5453 8126.0 1.0 2.0 0.0 \n", "5454 15652.0 30.0 43.0 0.0 \n", "5455 13825.0 83.0 85.0 0.0 \n", "5456 9733.0 97.0 98.0 0.0 \n", "5457 11009.0 58.0 53.0 0.0 \n", "5458 19237.0 83.0 85.0 0.0 \n", "5459 19223.0 11.0 4.0 0.0 \n", "5460 1419.0 66.0 69.0 0.0 \n", "5461 4772.0 40.0 28.0 0.0 \n", "5462 3785.0 77.0 80.0 0.0 \n", "5463 9729.0 88.0 90.0 0.0 \n", "5464 593.0 74.0 72.0 0.0 \n", "5465 8220.0 86.0 84.0 0.0 \n", "5466 20823.0 50.0 55.0 0.0 \n", "5467 3623.0 56.0 59.0 0.0 \n", "5468 15533.0 70.0 70.0 0.0 \n", "\n", " features rawPrediction probability prediction \n", "0 [2577.0, 38.0, 39.0] [-440.84691368645406] [1.0] 0.0 \n", "1 [21710.0, 33.0, 35.0] [-560.2845426594596] [1.0] 0.0 \n", "2 [6832.0, 34.0, 45.0] [-489.13697166313796] [1.0] 0.0 \n", "3 [9768.0, 25.0, 21.0] [-335.75198901002926] [1.0] 0.0 \n", "4 [13013.0, 12.0, 11.0] [-239.387649472332] [1.0] 0.0 \n", "5 [13936.0, 97.0, 98.0] [-1181.6164141280706] [1.0] 0.0 \n", "6 [13255.0, 21.0, 13.0] [-301.258614162017] [1.0] 0.0 \n", "7 [12198.0, 57.0, 56.0] [-720.9963093850602] [1.0] 0.0 \n", "8 [17980.0, 34.0, 45.0] [-587.2075683476078] [1.0] 0.0 \n", "9 [19118.0, 51.0, 62.0] [-781.8683149993153] [1.0] 0.0 \n", "10 [13766.0, 18.0, 20.0] [-327.4738938976721] [1.0] 0.0 \n", "11 [4031.0, 83.0, 85.0] [-947.8468165201091] [1.0] 0.0 \n", "12 [4836.0, 35.0, 32.0] [-406.41238288465064] [1.0] 0.0 \n", "13 [15726.0, 85.0, 83.0] [-1050.7308703158105] [1.0] 0.0 \n", "14 [15228.0, 62.0, 63.0] [-812.8214011632385] [1.0] 0.0 \n", "15 [899.0, 62.0, 63.0] [-686.7670793214941] [1.0] 0.0 \n", "16 [14638.0, 42.0, 34.0] [-541.5216249696589] [1.0] 0.0 \n", "17 [11407.0, 16.0, 26.0] [-328.44207005065533] [1.0] 0.0 \n", "18 [19361.0, 27.0, 27.0] [-463.58856733144796] [1.0] 0.0 \n", "19 [13710.0, 51.0, 62.0] [-734.2933430877965] [1.0] 0.0 \n", "20 [18602.0, 71.0, 75.0] [-956.5501906528943] [1.0] 0.0 \n", "21 [16454.0, 48.0, 42.0] [-633.528720854468] [1.0] 0.0 \n", "22 [20684.0, 91.0, 91.0] [-1170.3786026181476] [1.0] 0.0 \n", "23 [14327.0, 79.0, 66.0] [-913.5175409333822] [1.0] 0.0 \n", "24 [2666.0, 10.0, 5.0] [-104.9180221824804] [1.0] 0.0 \n", "25 [3722.0, 34.0, 45.0] [-461.7778439551454] [1.0] 0.0 \n", "26 [18888.0, 20.0, 23.0] [-399.6868792524099] [1.0] 0.0 \n", "27 [14365.0, 37.0, 36.0] [-522.8249118161635] [1.0] 0.0 \n", "28 [20227.0, 94.0, 94.0] [-1198.9435286840064] [1.0] 0.0 \n", "29 [19564.0, 17.0, 22.0] [-383.90956038827] [1.0] 0.0 \n", "... ... ... ... ... \n", "5439 [11120.0, 35.0, 32.0] [-461.69365571970695] [1.0] 0.0 \n", "5440 [16886.0, 44.0, 44.0] [-626.46522124715] [1.0] 0.0 \n", "5441 [3061.0, 78.0, 68.0] [-819.8386880641135] [1.0] 0.0 \n", "5442 [8471.0, 32.0, 47.0] [-503.55407827205806] [1.0] 0.0 \n", "5443 [13560.0, 95.0, 92.0] [-1134.8631413000821] [1.0] 0.0 \n", "5444 [19922.0, 42.0, 34.0] [-588.0057506317273] [1.0] 0.0 \n", "5445 [18021.0, 18.0, 20.0] [-364.9057551187359] [1.0] 0.0 \n", "5446 [4949.0, 8.0, 9.0] [-135.86152354163926] [1.0] 0.0 \n", "5447 [9308.0, 95.0, 92.0] [-1097.4576715205371] [1.0] 0.0 \n", "5448 [3627.0, 61.0, 51.0] [-640.1682801951773] [1.0] 0.0 \n", "5449 [2664.0, 97.0, 98.0] [-1082.4549711941504] [1.0] 0.0 \n", "5450 [20701.0, 50.0, 55.0] [-752.3493622870137] [1.0] 0.0 \n", "5451 [10148.0, 10.0, 5.0] [-170.73827733077636] [1.0] 0.0 \n", "5452 [6866.0, 35.0, 32.0] [-424.2705916458162] [1.0] 0.0 \n", "5453 [8126.0, 1.0, 2.0] [-87.77787468775551] [1.0] 0.0 \n", "5454 [15652.0, 30.0, 43.0] [-534.141878601174] [1.0] 0.0 \n", "5455 [13825.0, 83.0, 85.0] [-1034.0060759323533] [1.0] 0.0 \n", "5456 [9733.0, 97.0, 98.0] [-1144.642004560002] [1.0] 0.0 \n", "5457 [11009.0, 58.0, 53.0] [-699.6761782293465] [1.0] 0.0 \n", "5458 [19237.0, 83.0, 85.0] [-1081.6162364325642] [1.0] 0.0 \n", "5459 [19223.0, 11.0, 4.0] [-250.57309672944564] [1.0] 0.0 \n", "5460 [1419.0, 66.0, 69.0] [-745.6495909208345] [1.0] 0.0 \n", "5461 [4772.0, 40.0, 28.0] [-411.28342547001455] [1.0] 0.0 \n", "5462 [3785.0, 77.0, 80.0] [-885.9427896531429] [1.0] 0.0 \n", "5463 [9729.0, 88.0, 90.0] [-1052.281664984985] [1.0] 0.0 \n", "5464 [593.0, 74.0, 72.0] [-798.1244936259648] [1.0] 0.0 \n", "5465 [8220.0, 86.0, 84.0] [-995.5612244100009] [1.0] 0.0 \n", "5466 [20823.0, 50.0, 55.0] [-753.4226142421182] [1.0] 0.0 \n", "5467 [3623.0, 56.0, 59.0] [-656.4210955437193] [1.0] 0.0 \n", "5468 [15533.0, 70.0, 70.0] [-896.9679412626872] [1.0] 0.0 \n", "\n", "[5469 rows x 12 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pddf_pred" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- _We use Scatter plot for visualization and represent the dataset._" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "# Set the size of the plot\n", "plt.figure(figsize=(14,7))\n", " \n", "# Create a colormap\n", "colormap = np.array(['red', 'lime', 'black'])\n", " \n", "# Plot CDR\n", "plt.subplot(1, 2, 1)\n", "plt.scatter(pddf_pred.Calling_NumberIndex, pddf_pred.Called_NumberIndex, c=pddf_pred.prediction)\n", "plt.title('CallDetailRecord')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluation" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.0\n" ] } ], "source": [ "from pyspark.ml.evaluation import MulticlassClassificationEvaluator\n", "\n", "evaluator = MulticlassClassificationEvaluator(labelCol=\"label\", predictionCol=\"prediction\",\n", " metricName=\"accuracy\")\n", "accuracy = evaluator.evaluate(predictiondf)\n", "print(accuracy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Confusion Matrix" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sn\n", "\n", "outdataframe = predictiondf.select(\"prediction\", \"label\")\n", "pandadf = outdataframe.toPandas()\n", "npmat = pandadf.values\n", "labels = npmat[:,0]\n", "predicted_label = npmat[:,1]\n", "\n", "cnf_matrix = confusion_matrix(labels, predicted_label)\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def plot_confusion_matrix(cm,\n", " target_names,\n", " title='Confusion matrix',\n", " cmap=None,\n", " normalize=True):\n", "\n", " import matplotlib.pyplot as plt\n", " import numpy as np\n", " import itertools\n", "\n", " accuracy = np.trace(cm) / float(np.sum(cm))\n", " misclass = 1 - accuracy\n", "\n", " if cmap is None:\n", " cmap = plt.get_cmap('Blues')\n", "\n", " plt.figure(figsize=(8, 6))\n", " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", " plt.title(title)\n", " plt.colorbar()\n", "\n", " if target_names is not None:\n", " tick_marks = np.arange(len(target_names))\n", " plt.xticks(tick_marks, target_names, rotation=45)\n", " plt.yticks(tick_marks, target_names)\n", "\n", " if normalize:\n", " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", "\n", " thresh = cm.max() / 1.5 if normalize else cm.max() / 2\n", " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", " if normalize:\n", " plt.text(j, i, \"{:0.4f}\".format(cm[i, j]),\n", " horizontalalignment=\"center\",\n", " color=\"white\" if cm[i, j] > thresh else \"black\")\n", " else:\n", " plt.text(j, i, \"{:,}\".format(cm[i, j]),\n", " horizontalalignment=\"center\",\n", " color=\"white\" if cm[i, j] > thresh else \"black\")\n", "\n", "\n", " plt.tight_layout()\n", " plt.ylabel('label')\n", " plt.xlabel('Predicted \\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_confusion_matrix(cnf_matrix,\n", " normalize = False,\n", " target_names = ['Positive', 'Negative'],\n", " title = \"Confusion Matrix\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DenseMatrix([[5469.]])\n", "\n" ] } ], "source": [ "from pyspark.mllib.evaluation import MulticlassMetrics\n", "# Create (prediction, label) pairs\n", "predictionAndLabel = predictiondf.select(\"prediction\", \"label\").rdd\n", "\n", "# Generate confusion matrix\n", "metrics = MulticlassMetrics(predictionAndLabel)\n", "print(metrics.confusionMatrix())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Cross Validation" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from pyspark.ml.tuning import ParamGridBuilder, CrossValidator\n", "\n", "# Create ParamGrid and Evaluator for Cross Validation\n", "paramGrid = ParamGridBuilder().addGrid(nb.smoothing, [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]).build()\n", "cvEvaluator = MulticlassClassificationEvaluator(metricName=\"accuracy\")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# Run Cross-validation\n", "cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=cvEvaluator)\n", "cvModel = cv.fit(trainData)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# Make predictions on testData. cvModel uses the bestModel.\n", "cvPredictions = cvModel.transform(testData)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+----------+-----------+\n", "|label|prediction|probability|\n", "+-----+----------+-----------+\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "| 0.0| 0.0| [1.0]|\n", "+-----+----------+-----------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "cvPredictions.select(\"label\", \"prediction\", \"probability\").show()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Evaluate bestModel found from Cross Validation\n", "evaluator.evaluate(cvPredictions)" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }