{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Using AI Services for Analyzing Public Data\n", "by Manav Sehgal | on APR 30 2019\n", "\n", "So far we have been working with structured data in flat files as our data source. What if the source is images and unstructured text. AWS AI services provide vision, transcription, translation, personalization, and forecasting capabilities without the need for training and deploying machine learning models. AWS manages the machine learning complexity, you just focus on the problem at hand and send required inputs for analysis and receive output from these services within your applications.\n", "\n", "Extending our open data analytics use case to New York Traffic let us use the AWS AI services to turn open data available in social media, Wikipedia, and other sources into structured datasets and insights.\n", "\n", "We will start by importing dependencies for AWS SDK, Python Data Frames, file operations, handeling JSON data, and display formatting. We will initialize the Rekognition client for use in the rest of this notebook." ] }, { "cell_type": "code", "execution_count": 216, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import pandas as pd\n", "import io\n", "import json\n", "from IPython.display import display, Markdown, Image\n", "\n", "rekognition = boto3.client('rekognition','us-east-1')\n", "image_bucket = 'open-data-analytics-taxi-trips'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Show Image\n", "We will work with a number of images so we need a way to show these images within this notebook. Our function creates a public image URL based on S3 bucket and key as input." ] }, { "cell_type": "code", "execution_count": 217, "metadata": {}, "outputs": [], "source": [ "def show_image(bucket, key, img_width = 500):\n", " # [TODO] Load non-public images\n", " return Image(url='https://s3.amazonaws.com/' + bucket + '/' + key, width=img_width)" ] }, { "cell_type": "code", "execution_count": 218, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 218, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_image(image_bucket, 'images/traffic-in-manhattan.jpg', 1024)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Image Labels\n", "One of use cases for traffic analytics is processing traffic CCTV imagery or social media uploads. Let's consider a traffic location where depending on number of cars, trucks, and pedestrians we can identify if there is a traffic jam. This insight can be used to better manage flow of traffic around the location and plan ahead for future use of this route.\n", "\n", "First step in this kind of analytics is to recognize that we are actually looking at an image which may represent a traffic jam. We create ``image_labels`` function which uses ``detect_lables`` Rekognition API to detect objects within an image. The function prints labels detected with confidence score.\n", "\n", "In the given example notice somewhere in the middle of the labels listing at 73% confidence the Rekognition computer vision model has actually determined a traffic jam." ] }, { "cell_type": "code", "execution_count": 219, "metadata": {}, "outputs": [], "source": [ "def image_labels(bucket, key):\n", " image_object = {'S3Object':{'Bucket': bucket,'Name': key}}\n", "\n", " response = rekognition.detect_labels(Image=image_object)\n", " for label in response['Labels']:\n", " print('{} ({:.0f}%)'.format(label['Name'], label['Confidence']))" ] }, { "cell_type": "code", "execution_count": 220, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vehicle (100%)\n", "Automobile (100%)\n", "Transportation (100%)\n", "Car (100%)\n", "Human (99%)\n", "Person (99%)\n", "Truck (98%)\n", "Machine (96%)\n", "Wheel (96%)\n", "Clothing (87%)\n", "Apparel (87%)\n", "Footwear (87%)\n", "Shoe (87%)\n", "Road (75%)\n", "Traffic Jam (73%)\n", "City (73%)\n", "Urban (73%)\n", "Metropolis (73%)\n", "Building (73%)\n", "Town (73%)\n", "Cab (71%)\n", "Taxi (71%)\n", "Traffic Light (68%)\n", "Light (68%)\n", "Neighborhood (62%)\n", "People (62%)\n", "Pedestrian (59%)\n" ] } ], "source": [ "image_labels(image_bucket, 'images/traffic-in-manhattan.jpg')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Image Label Count\n", "Now that we have a label detecting a traffic jam and some of the ingredients of a busy traffic location like pedestrians, trucks, cars, let us determine quantitative data for benchmarking different traffic locations. If we can count the number of cars, trucks, and persons in the image we can compare these numbers with other images. Our function does just that, it counts the number of instances of a matching label." ] }, { "cell_type": "code", "execution_count": 221, "metadata": {}, "outputs": [], "source": [ "def image_label_count(bucket, key, match): \n", " image_object = {'S3Object':{'Bucket': bucket,'Name': key}}\n", "\n", " response = rekognition.detect_labels(Image=image_object)\n", " count = 0\n", " for label in response['Labels']:\n", " if match in label['Name']:\n", " for instance in label['Instances']:\n", " count += 1\n", " print(f'Found {match} {count} times.')" ] }, { "cell_type": "code", "execution_count": 222, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found Car 9 times.\n" ] } ], "source": [ "image_label_count(image_bucket, 'images/traffic-in-manhattan.jpg', 'Car')" ] }, { "cell_type": "code", "execution_count": 223, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found Truck 4 times.\n" ] } ], "source": [ "image_label_count(image_bucket, 'images/traffic-in-manhattan.jpg', 'Truck')" ] }, { "cell_type": "code", "execution_count": 224, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found Person 8 times.\n" ] } ], "source": [ "image_label_count(image_bucket, 'images/traffic-in-manhattan.jpg', 'Person')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Image Text\n", "Another use case of traffic location analytics using social media content is to understand more about a traffic location and instance if there is an incident reported, like an accident, jam, or VIP movement. For a computer program to understand a random traffic location, it may help to capture any text within the image. The ``image_text`` function uses Amazon Rekognition service to detect text in an image.\n", "\n", "You will notice that the text recognition is capable to read blurry text like \"The Lion King\", text which is at a perspective like the bus route, text which may be ignored by the human eye like the address below the shoes banner, and even the text representing the taxi number. Suddenly the image starts telling a story programmatically, about what time it may represent, what are the landmarks, which bus route, which taxi number was on streets, and so on." ] }, { "cell_type": "code", "execution_count": 225, "metadata": {}, "outputs": [], "source": [ "def image_text(bucket, key, sort_column='', parents=True):\n", " response = rekognition.detect_text(Image={'S3Object':{'Bucket':bucket,'Name': key}})\n", " df = pd.read_json(io.StringIO(json.dumps(response['TextDetections'])))\n", " df['Width'] = df['Geometry'].apply(lambda x: x['BoundingBox']['Width'])\n", " df['Height'] = df['Geometry'].apply(lambda x: x['BoundingBox']['Height'])\n", " df['Left'] = df['Geometry'].apply(lambda x: x['BoundingBox']['Left'])\n", " df['Top'] = df['Geometry'].apply(lambda x: x['BoundingBox']['Top'])\n", " df = df.drop(columns=['Geometry'])\n", " if sort_column:\n", " df = df.sort_values([sort_column])\n", " if not parents:\n", " df = df[df['ParentId'] > 0]\n", " return df" ] }, { "cell_type": "code", "execution_count": 226, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 226, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_image(image_bucket, 'images/nyc-taxi-signs.jpeg', 1024)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sorting on ``Top`` column will keep the horizontal text together." ] }, { "cell_type": "code", "execution_count": 227, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ConfidenceDetectedTextIdParentIdTypeWidthHeightLeftTop
1491.874588WAY151.0WORD0.0284700.0193850.5994000.109109
1583.1339576ASW141.0WORD0.0340890.0184040.5701430.126126
1794.518997HAN'S172.0WORD0.0709710.0321110.3885970.187187
1699.643578DELI162.0WORD0.0808920.0411510.2813200.201201
1890.439888&183.0WORD0.0270070.0440440.3645910.212212
1999.936119GROCERY193.0WORD0.1509990.0421490.3998500.217217
2081.925537ZiGi204.0WORD0.0270070.0230350.5956490.265265
2195.180290SHOES215.0WORD0.0416950.0190780.6219060.269269
2291.584435X29CONEYSL235.0WORD0.1084480.0385090.8874720.279279
2390.353638x29245.0WORD0.0388960.0332450.8889720.282282
2496.308746647225.0WORD0.0187550.0160160.7479370.293293
2597.540222BROADWAY256.0WORD0.0552100.0180340.7681920.295295
2689.723869NEW277.0WORD0.0337580.0190190.5873970.379379
2792.452881YORK287.0WORD0.0352730.0200340.6189050.382382
2992.044113CITY297.0WORD0.0270070.0160160.6556640.389389
2895.421768food267.0WORD0.0337580.0240240.5558890.392392
3396.425499WINE338.0WORD0.0435110.0220220.5926480.398398
3187.556793LIon318.0WORD0.0412600.0300300.3360840.400400
3290.025482KING328.0WORD0.0450220.0330420.3773440.400400
3496.632484FOOD358.0WORD0.0435220.0210340.6459110.402402
3098.496071THE308.0WORD0.0315080.0230230.3038260.403403
3696.938141FESTIVALS348.0WORD0.0900280.0210340.5963990.419419
3571.623650EME369.0WORD0.0292570.0270270.4501130.426426
3788.608627Oct.9-12379.0WORD0.0367730.0160160.5536380.437437
3891.010559SALE389.0WORD0.0237880.0181580.7359340.452452
3980.209969023910.0WORD0.0240270.0210340.0772690.488488
4085.6823739214'',4011.0WORD0.1126880.0280680.7621910.600601
4197.959709TAXI4212.0WORD0.1045830.0521010.4883720.716717
4296.415970NYC4112.0WORD0.0661380.0360670.4141040.736737
\n", "
" ], "text/plain": [ " Confidence DetectedText Id ParentId Type Width Height Left \\\n", "14 91.874588 WAY 15 1.0 WORD 0.028470 0.019385 0.599400 \n", "15 83.133957 6ASW 14 1.0 WORD 0.034089 0.018404 0.570143 \n", "17 94.518997 HAN'S 17 2.0 WORD 0.070971 0.032111 0.388597 \n", "16 99.643578 DELI 16 2.0 WORD 0.080892 0.041151 0.281320 \n", "18 90.439888 & 18 3.0 WORD 0.027007 0.044044 0.364591 \n", "19 99.936119 GROCERY 19 3.0 WORD 0.150999 0.042149 0.399850 \n", "20 81.925537 ZiGi 20 4.0 WORD 0.027007 0.023035 0.595649 \n", "21 95.180290 SHOES 21 5.0 WORD 0.041695 0.019078 0.621906 \n", "22 91.584435 X29CONEYSL 23 5.0 WORD 0.108448 0.038509 0.887472 \n", "23 90.353638 x29 24 5.0 WORD 0.038896 0.033245 0.888972 \n", "24 96.308746 647 22 5.0 WORD 0.018755 0.016016 0.747937 \n", "25 97.540222 BROADWAY 25 6.0 WORD 0.055210 0.018034 0.768192 \n", "26 89.723869 NEW 27 7.0 WORD 0.033758 0.019019 0.587397 \n", "27 92.452881 YORK 28 7.0 WORD 0.035273 0.020034 0.618905 \n", "29 92.044113 CITY 29 7.0 WORD 0.027007 0.016016 0.655664 \n", "28 95.421768 food 26 7.0 WORD 0.033758 0.024024 0.555889 \n", "33 96.425499 WINE 33 8.0 WORD 0.043511 0.022022 0.592648 \n", "31 87.556793 LIon 31 8.0 WORD 0.041260 0.030030 0.336084 \n", "32 90.025482 KING 32 8.0 WORD 0.045022 0.033042 0.377344 \n", "34 96.632484 FOOD 35 8.0 WORD 0.043522 0.021034 0.645911 \n", "30 98.496071 THE 30 8.0 WORD 0.031508 0.023023 0.303826 \n", "36 96.938141 FESTIVALS 34 8.0 WORD 0.090028 0.021034 0.596399 \n", "35 71.623650 EME 36 9.0 WORD 0.029257 0.027027 0.450113 \n", "37 88.608627 Oct.9-12 37 9.0 WORD 0.036773 0.016016 0.553638 \n", "38 91.010559 SALE 38 9.0 WORD 0.023788 0.018158 0.735934 \n", "39 80.209969 02 39 10.0 WORD 0.024027 0.021034 0.077269 \n", "40 85.682373 9214'', 40 11.0 WORD 0.112688 0.028068 0.762191 \n", "41 97.959709 TAXI 42 12.0 WORD 0.104583 0.052101 0.488372 \n", "42 96.415970 NYC 41 12.0 WORD 0.066138 0.036067 0.414104 \n", "\n", " Top \n", "14 0.109109 \n", "15 0.126126 \n", "17 0.187187 \n", "16 0.201201 \n", "18 0.212212 \n", "19 0.217217 \n", "20 0.265265 \n", "21 0.269269 \n", "22 0.279279 \n", "23 0.282282 \n", "24 0.293293 \n", "25 0.295295 \n", "26 0.379379 \n", "27 0.382382 \n", "29 0.389389 \n", "28 0.392392 \n", "33 0.398398 \n", "31 0.400400 \n", "32 0.400400 \n", "34 0.402402 \n", "30 0.403403 \n", "36 0.419419 \n", "35 0.426426 \n", "37 0.437437 \n", "38 0.452452 \n", "39 0.488488 \n", "40 0.600601 \n", "41 0.716717 \n", "42 0.736737 " ] }, "execution_count": 227, "metadata": {}, "output_type": "execute_result" } ], "source": [ "image_text(image_bucket, 'images/nyc-taxi-signs.jpeg', sort_column='Top', parents=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detect Celebs\n", "Traffic analytics may also involve detecting VIP movement to divert traffic or monitor security events. Detecting VIP in a scene starts with facial recognition. Our function ``detect_celebs`` works as well with political figures as it will with movie celebrities." ] }, { "cell_type": "code", "execution_count": 228, "metadata": {}, "outputs": [], "source": [ "def detect_celebs(bucket, key, sort_column=''):\n", " image_object = {'S3Object':{'Bucket': bucket,'Name': key}}\n", "\n", " response = rekognition.recognize_celebrities(Image=image_object)\n", " df = pd.DataFrame(response['CelebrityFaces'])\n", " df['Width'] = df['Face'].apply(lambda x: x['BoundingBox']['Width'])\n", " df['Height'] = df['Face'].apply(lambda x: x['BoundingBox']['Height'])\n", " df['Left'] = df['Face'].apply(lambda x: x['BoundingBox']['Left'])\n", " df['Top'] = df['Face'].apply(lambda x: x['BoundingBox']['Top'])\n", " df = df.drop(columns=['Face'])\n", " if sort_column:\n", " df = df.sort_values([sort_column])\n", " return(df)" ] }, { "cell_type": "code", "execution_count": 229, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 229, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_image(image_bucket, 'images/world-leaders.jpg', 1024)" ] }, { "cell_type": "code", "execution_count": 230, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdMatchConfidenceNameUrlsWidthHeightLeftTop
34Ev8IX1100.0Chulabhorn[]0.0202020.0389730.0151520.424905
53J795K100.0Manmohan Singh[]0.0186870.0351710.1313130.420152
25f0JR5e90.0Mahinda Rajapaksa[]0.0161620.0304180.1459600.319392
303n7tl2O88.0Killah Priest[www.imdb.com/name/nm0697334]0.0146460.0275670.1621210.290875
122gC0Tc0e100.0Rosen Plevneliev[]0.0181820.0342210.1792930.367871
193LR2lb6j56.0Jerry Harrison[]0.0171720.0323190.2272730.330798
14hD40O100.0Thomas Boni Yayi[]0.0217170.0408750.2363640.399240
222F5LV463.0Irwansyah[www.imdb.com/name/nm2679097]0.0166670.0313690.2747470.340304
83hk2qj5G98.0Cristina Fernández de Kirchner[www.imdb.com/name/nm3231417]0.0186870.0351710.2782830.414449
132sN1oC8s100.0Jorge Carlos Fonseca[]0.0181820.0342210.2808080.370722
93Ns4kC2b100.0Sebastián Piñera[]0.0186870.0351710.3186870.374525
151qy7Yt8D100.0Gurbanguly Berdimuhamedow[]0.0181820.0342210.3348480.317490
41eA7EJ2W63.0Salim Durani[]0.0191920.0361220.4186870.331749
202vr4uV3M95.0Albert II, Prince of Monaco[]0.0171720.0323190.4636360.332700
294pv6OP890.0Nick Clegg[www.imdb.com/name/nm2200958]0.0151520.0285170.4651520.255703
7pL8KD9X100.0Denis Sassou Nguesso[]0.0186870.0351710.4727270.368821
046JZ2c97.0Ban Ki-moon[www.imdb.com/name/nm2559634]0.0227270.0427760.5267680.402091
272yG8Fe4x79.0Mem Fox[]0.0151520.0285170.6070710.351711
182nk8Bd058.0Ali Bongo Ondimba[]0.0171720.0323190.6121210.381179
22aE2DV3K100.0Susilo Bambang Yudhoyono[www.imdb.com/name/nm2670444]0.0207070.0389730.6262630.403042
173m4lC082.0Uhuru Kenyatta[www.imdb.com/name/nm6045979]0.0171720.0323190.6505050.343156
28K8hL4i67.0Erkki Tuomioja[]0.0151520.0285170.6570710.280418
262KJ7KM8e100.0Isatou Njie-Saidy[]0.0156570.0294680.6661620.396388
14aU4fU4100.0Laura Chinchilla[]0.0181820.0342210.6797980.429658
162DM2OT1F91.0Alpha Condé[]0.0176770.0332700.7085860.369772
114eh5t9f99.0Helle Thorning-Schmidt[www.imdb.com/name/nm1525284]0.0181820.0342210.7232320.399240
21Em8cA8q70.0Ollanta Humala[]0.0171720.0323190.7666670.355513
244FT4On6a94.0Mariano Rajoy[www.imdb.com/name/nm1775577]0.0161620.0304180.7868690.282319
231oa5Af173.0James Van Praagh[www.imdb.com/name/nm1070530]0.0166670.0313690.8060610.378327
1047mP8282.0János Áder[]0.0181820.0342210.8484850.365970
616BU2ey99.0José Manuel Barroso[]0.0186870.0351710.9606060.408745
\n", "
" ], "text/plain": [ " Id MatchConfidence Name \\\n", "3 4Ev8IX1 100.0 Chulabhorn \n", "5 3J795K 100.0 Manmohan Singh \n", "25 f0JR5e 90.0 Mahinda Rajapaksa \n", "30 3n7tl2O 88.0 Killah Priest \n", "12 2gC0Tc0e 100.0 Rosen Plevneliev \n", "19 3LR2lb6j 56.0 Jerry Harrison \n", "1 4hD40O 100.0 Thomas Boni Yayi \n", "22 2F5LV4 63.0 Irwansyah \n", "8 3hk2qj5G 98.0 Cristina Fernández de Kirchner \n", "13 2sN1oC8s 100.0 Jorge Carlos Fonseca \n", "9 3Ns4kC2b 100.0 Sebastián Piñera \n", "15 1qy7Yt8D 100.0 Gurbanguly Berdimuhamedow \n", "4 1eA7EJ2W 63.0 Salim Durani \n", "20 2vr4uV3M 95.0 Albert II, Prince of Monaco \n", "29 4pv6OP8 90.0 Nick Clegg \n", "7 pL8KD9X 100.0 Denis Sassou Nguesso \n", "0 46JZ2c 97.0 Ban Ki-moon \n", "27 2yG8Fe4x 79.0 Mem Fox \n", "18 2nk8Bd0 58.0 Ali Bongo Ondimba \n", "2 2aE2DV3K 100.0 Susilo Bambang Yudhoyono \n", "17 3m4lC0 82.0 Uhuru Kenyatta \n", "28 K8hL4i 67.0 Erkki Tuomioja \n", "26 2KJ7KM8e 100.0 Isatou Njie-Saidy \n", "14 aU4fU4 100.0 Laura Chinchilla \n", "16 2DM2OT1F 91.0 Alpha Condé \n", "11 4eh5t9f 99.0 Helle Thorning-Schmidt \n", "21 Em8cA8q 70.0 Ollanta Humala \n", "24 4FT4On6a 94.0 Mariano Rajoy \n", "23 1oa5Af1 73.0 James Van Praagh \n", "10 47mP82 82.0 János Áder \n", "6 16BU2ey 99.0 José Manuel Barroso \n", "\n", " Urls Width Height Left Top \n", "3 [] 0.020202 0.038973 0.015152 0.424905 \n", "5 [] 0.018687 0.035171 0.131313 0.420152 \n", "25 [] 0.016162 0.030418 0.145960 0.319392 \n", "30 [www.imdb.com/name/nm0697334] 0.014646 0.027567 0.162121 0.290875 \n", "12 [] 0.018182 0.034221 0.179293 0.367871 \n", "19 [] 0.017172 0.032319 0.227273 0.330798 \n", "1 [] 0.021717 0.040875 0.236364 0.399240 \n", "22 [www.imdb.com/name/nm2679097] 0.016667 0.031369 0.274747 0.340304 \n", "8 [www.imdb.com/name/nm3231417] 0.018687 0.035171 0.278283 0.414449 \n", "13 [] 0.018182 0.034221 0.280808 0.370722 \n", "9 [] 0.018687 0.035171 0.318687 0.374525 \n", "15 [] 0.018182 0.034221 0.334848 0.317490 \n", "4 [] 0.019192 0.036122 0.418687 0.331749 \n", "20 [] 0.017172 0.032319 0.463636 0.332700 \n", "29 [www.imdb.com/name/nm2200958] 0.015152 0.028517 0.465152 0.255703 \n", "7 [] 0.018687 0.035171 0.472727 0.368821 \n", "0 [www.imdb.com/name/nm2559634] 0.022727 0.042776 0.526768 0.402091 \n", "27 [] 0.015152 0.028517 0.607071 0.351711 \n", "18 [] 0.017172 0.032319 0.612121 0.381179 \n", "2 [www.imdb.com/name/nm2670444] 0.020707 0.038973 0.626263 0.403042 \n", "17 [www.imdb.com/name/nm6045979] 0.017172 0.032319 0.650505 0.343156 \n", "28 [] 0.015152 0.028517 0.657071 0.280418 \n", "26 [] 0.015657 0.029468 0.666162 0.396388 \n", "14 [] 0.018182 0.034221 0.679798 0.429658 \n", "16 [] 0.017677 0.033270 0.708586 0.369772 \n", "11 [www.imdb.com/name/nm1525284] 0.018182 0.034221 0.723232 0.399240 \n", "21 [] 0.017172 0.032319 0.766667 0.355513 \n", "24 [www.imdb.com/name/nm1775577] 0.016162 0.030418 0.786869 0.282319 \n", "23 [www.imdb.com/name/nm1070530] 0.016667 0.031369 0.806061 0.378327 \n", "10 [] 0.018182 0.034221 0.848485 0.365970 \n", "6 [] 0.018687 0.035171 0.960606 0.408745 " ] }, "execution_count": 230, "metadata": {}, "output_type": "execute_result" } ], "source": [ "detect_celebs(image_bucket, 'images/world-leaders.jpg', sort_column='Left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comprehend Syntax\n", "It is possible that many data sources represent natural language and free text. Understand structure and semantics from this unstructured text can help further our open data analytics use cases.\n", "\n", "Let us assume we are processing traffic updates for structured data so we can take appropriate actions. First step in understanding natural language is to break it up into grammaticaly syntax. Nouns like \"today\" can tell about a particular event like when is the event occuring. Adjectives like \"snowing\" and \"windy\" tell what is happening at that moment in time. " ] }, { "cell_type": "code", "execution_count": 231, "metadata": {}, "outputs": [], "source": [ "comprehend = boto3.client('comprehend', 'us-east-1')\n", "\n", "traffic_update = \"\"\"\n", "It is snowing and windy today in New York. The temperature is 50 degrees Fahrenheit. \n", "The traffic is slow 10 mph with several jams along the I-86.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 232, "metadata": {}, "outputs": [], "source": [ "def comprehend_syntax(text): \n", " response = comprehend.detect_syntax(Text=text, LanguageCode='en')\n", " df = pd.read_json(io.StringIO(json.dumps(response['SyntaxTokens'])))\n", " df['Tag'] = df['PartOfSpeech'].apply(lambda x: x['Tag'])\n", " df['Score'] = df['PartOfSpeech'].apply(lambda x: x['Score'])\n", " df = df.drop(columns=['PartOfSpeech'])\n", " return df" ] }, { "cell_type": "code", "execution_count": 233, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BeginOffsetEndOffsetTextTokenIdTagScore
013It1PRON0.999971
146is2VERB0.557677
2714snowing3ADJ0.687805
31518and4CONJ0.999998
41924windy5ADJ0.994336
52530today6NOUN0.999980
63133in7ADP0.999924
73437New8PROPN0.999351
83842York9PROPN0.998399
94243.10PUNCT0.999998
104447The11DET0.999979
114859temperature12NOUN0.999760
126062is13VERB0.998011
1363655014NUM0.999716
146673degrees15NOUN0.999700
157484Fahrenheit16PROPN0.950743
168485.17PUNCT0.999994
178790The18DET0.999975
189198traffic19NOUN0.999450
1999101is20VERB0.965014
20102106slow21ADJ0.815718
211071091022NUM0.999991
22110113mph23NOUN0.988531
23114118with24ADP0.973397
24119126several25ADJ0.999647
25127131jams26NOUN0.999936
26132137along27ADP0.997718
27138141the28DET0.999960
28142143I29PROPN0.745183
29143144-30PUNCT0.999858
301441468631PROPN0.684016
31146147.32PUNCT0.999985
\n", "
" ], "text/plain": [ " BeginOffset EndOffset Text TokenId Tag Score\n", "0 1 3 It 1 PRON 0.999971\n", "1 4 6 is 2 VERB 0.557677\n", "2 7 14 snowing 3 ADJ 0.687805\n", "3 15 18 and 4 CONJ 0.999998\n", "4 19 24 windy 5 ADJ 0.994336\n", "5 25 30 today 6 NOUN 0.999980\n", "6 31 33 in 7 ADP 0.999924\n", "7 34 37 New 8 PROPN 0.999351\n", "8 38 42 York 9 PROPN 0.998399\n", "9 42 43 . 10 PUNCT 0.999998\n", "10 44 47 The 11 DET 0.999979\n", "11 48 59 temperature 12 NOUN 0.999760\n", "12 60 62 is 13 VERB 0.998011\n", "13 63 65 50 14 NUM 0.999716\n", "14 66 73 degrees 15 NOUN 0.999700\n", "15 74 84 Fahrenheit 16 PROPN 0.950743\n", "16 84 85 . 17 PUNCT 0.999994\n", "17 87 90 The 18 DET 0.999975\n", "18 91 98 traffic 19 NOUN 0.999450\n", "19 99 101 is 20 VERB 0.965014\n", "20 102 106 slow 21 ADJ 0.815718\n", "21 107 109 10 22 NUM 0.999991\n", "22 110 113 mph 23 NOUN 0.988531\n", "23 114 118 with 24 ADP 0.973397\n", "24 119 126 several 25 ADJ 0.999647\n", "25 127 131 jams 26 NOUN 0.999936\n", "26 132 137 along 27 ADP 0.997718\n", "27 138 141 the 28 DET 0.999960\n", "28 142 143 I 29 PROPN 0.745183\n", "29 143 144 - 30 PUNCT 0.999858\n", "30 144 146 86 31 PROPN 0.684016\n", "31 146 147 . 32 PUNCT 0.999985" ] }, "execution_count": 233, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comprehend_syntax(traffic_update)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comprehend Entities\n", "More insights can be derived by doing entity extraction from the natural langauage. These entities can be date, location, quantity, among others. Just few of the entities can tell a structured story to a program." ] }, { "cell_type": "code", "execution_count": 234, "metadata": {}, "outputs": [], "source": [ "def comprehend_entities(text):\n", " response = comprehend.detect_entities(Text=text, LanguageCode='en')\n", " df = pd.read_json(io.StringIO(json.dumps(response['Entities'])))\n", " return df" ] }, { "cell_type": "code", "execution_count": 235, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BeginOffsetEndOffsetScoreTextType
025300.839589todayDATE
134420.998423New YorkLOCATION
263840.98439650 degrees FahrenheitQUANTITY
31071130.99249810 mphQUANTITY
41421460.990993I-86LOCATION
\n", "
" ], "text/plain": [ " BeginOffset EndOffset Score Text Type\n", "0 25 30 0.839589 today DATE\n", "1 34 42 0.998423 New York LOCATION\n", "2 63 84 0.984396 50 degrees Fahrenheit QUANTITY\n", "3 107 113 0.992498 10 mph QUANTITY\n", "4 142 146 0.990993 I-86 LOCATION" ] }, "execution_count": 235, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comprehend_entities(traffic_update)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comprehend Phrases\n", "Analysis of phrases within narutal language text complements the other two methods for a program to better route the actions based on derived structure of the event." ] }, { "cell_type": "code", "execution_count": 236, "metadata": {}, "outputs": [], "source": [ "def comprehend_phrases(text):\n", " response = comprehend.detect_key_phrases(Text=text, LanguageCode='en')\n", " df = pd.read_json(io.StringIO(json.dumps(response['KeyPhrases'])))\n", " return df" ] }, { "cell_type": "code", "execution_count": 237, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BeginOffsetEndOffsetScoreText
025300.988285today
134420.997397New York
244590.999752The temperature
363730.78984350 degrees
487980.999843The traffic
51071130.92473710 mph
61191310.998428several jams
71381460.997108the I-86
\n", "
" ], "text/plain": [ " BeginOffset EndOffset Score Text\n", "0 25 30 0.988285 today\n", "1 34 42 0.997397 New York\n", "2 44 59 0.999752 The temperature\n", "3 63 73 0.789843 50 degrees\n", "4 87 98 0.999843 The traffic\n", "5 107 113 0.924737 10 mph\n", "6 119 131 0.998428 several jams\n", "7 138 146 0.997108 the I-86" ] }, "execution_count": 237, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comprehend_phrases(traffic_update)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comprehend Sentiment\n", "Sentiment analysis is common for social media user generated content. Sentiment can give us signals on the users' mood when publishing such social data." ] }, { "cell_type": "code", "execution_count": 238, "metadata": {}, "outputs": [], "source": [ "def comprehend_sentiment(text):\n", " response = comprehend.detect_sentiment(Text=text, LanguageCode='en')\n", " return response['SentimentScore']" ] }, { "cell_type": "code", "execution_count": 239, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Positive': 0.04090394824743271,\n", " 'Negative': 0.3745909333229065,\n", " 'Neutral': 0.5641733407974243,\n", " 'Mixed': 0.020331736654043198}" ] }, "execution_count": 239, "metadata": {}, "output_type": "execute_result" } ], "source": [ "comprehend_sentiment(traffic_update)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Change Log\n", "\n", "This section captures changes and updates to this notebook across releases.\n", "\n", "#### Usability and sorting for text and face detection - Release 3 MAY 2019\n", "Functions ``image_text`` and ``detect_celeb`` can now sort results based on a column name. Function ``image_text`` can optionally show results without parent-child relations.\n", "\n", "Usability update for ``comprehend_syntax`` function to split ``part of speech`` dictionary value into separate Tag and Score columns.\n", "\n", "#### Launch - Release 30 APR 2019\n", "This is the launch release which builds the AWS Open Data Analytics API for using AWS AI services to analyze public data.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "#### Using AI Services for Analyzing Public Data\n", "by Manav Sehgal | on APR 30 2019\n" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }