## Using AI Services for Analyzing Public Data by Manav Sehgal | on APR 30 2019 So far we have been working with structured data in flat files as our data source. What if the source is images and unstructured text. AWS AI services provide vision, transcription, translation, personalization, and forecasting capabilities without the need for training and deploying machine learning models. AWS manages the machine learning complexity, you just focus on the problem at hand and send required inputs for analysis and receive output from these services within your applications. Extending our open data analytics use case to New York Traffic let us use the AWS AI services to turn open data available in social media, Wikipedia, and other sources into structured datasets and insights. We will start by importing dependencies for AWS SDK, Python Data Frames, file operations, handeling JSON data, and display formatting. We will initialize the Rekognition client for use in the rest of this notebook. ```python import boto3 import pandas as pd import io import json from IPython.display import display, Markdown, Image rekognition = boto3.client('rekognition','us-east-1') image_bucket = 'open-data-analytics-taxi-trips' ``` ### Show Image We will work with a number of images so we need a way to show these images within this notebook. Our function creates a public image URL based on S3 bucket and key as input. ```python def show_image(bucket, key, img_width = 500): # [TODO] Load non-public images return Image(url='https://s3.amazonaws.com/' + bucket + '/' + key, width=img_width) ``` ```python show_image(image_bucket, 'images/traffic-in-manhattan.jpg', 1024) ```

### Image Labels One of use cases for traffic analytics is processing traffic CCTV imagery or social media uploads. Let's consider a traffic location where depending on number of cars, trucks, and pedestrians we can identify if there is a traffic jam. This insight can be used to better manage flow of traffic around the location and plan ahead for future use of this route. First step in this kind of analytics is to recognize that we are actually looking at an image which may represent a traffic jam. We create ``image_labels`` function which uses ``detect_lables`` Rekognition API to detect objects within an image. The function prints labels detected with confidence score. In the given example notice somewhere in the middle of the labels listing at 73% confidence the Rekognition computer vision model has actually determined a traffic jam. ```python def image_labels(bucket, key): image_object = {'S3Object':{'Bucket': bucket,'Name': key}} response = rekognition.detect_labels(Image=image_object) for label in response['Labels']: print('{} ({:.0f}%)'.format(label['Name'], label['Confidence'])) ``` ```python image_labels(image_bucket, 'images/traffic-in-manhattan.jpg') ``` Vehicle (100%) Automobile (100%) Transportation (100%) Car (100%) Human (99%) Person (99%) Truck (98%) Machine (96%) Wheel (96%) Clothing (87%) Apparel (87%) Footwear (87%) Shoe (87%) Road (75%) Traffic Jam (73%) City (73%) Urban (73%) Metropolis (73%) Building (73%) Town (73%) Cab (71%) Taxi (71%) Traffic Light (68%) Light (68%) Neighborhood (62%) People (62%) Pedestrian (59%) ### Image Label Count Now that we have a label detecting a traffic jam and some of the ingredients of a busy traffic location like pedestrians, trucks, cars, let us determine quantitative data for benchmarking different traffic locations. If we can count the number of cars, trucks, and persons in the image we can compare these numbers with other images. Our function does just that, it counts the number of instances of a matching label. ```python def image_label_count(bucket, key, match): image_object = {'S3Object':{'Bucket': bucket,'Name': key}} response = rekognition.detect_labels(Image=image_object) count = 0 for label in response['Labels']: if match in label['Name']: for instance in label['Instances']: count += 1 print(f'Found {match} {count} times.') ``` ```python image_label_count(image_bucket, 'images/traffic-in-manhattan.jpg', 'Car') ``` Found Car 9 times. ```python image_label_count(image_bucket, 'images/traffic-in-manhattan.jpg', 'Truck') ``` Found Truck 4 times. ```python image_label_count(image_bucket, 'images/traffic-in-manhattan.jpg', 'Person') ``` Found Person 8 times. ### Image Text Another use case of traffic location analytics using social media content is to understand more about a traffic location and instance if there is an incident reported, like an accident, jam, or VIP movement. For a computer program to understand a random traffic location, it may help to capture any text within the image. The ``image_text`` function uses Amazon Rekognition service to detect text in an image. You will notice that the text recognition is capable to read blurry text like "The Lion King", text which is at a perspective like the bus route, text which may be ignored by the human eye like the address below the shoes banner, and even the text representing the taxi number. Suddenly the image starts telling a story programmatically, about what time it may represent, what are the landmarks, which bus route, which taxi number was on streets, and so on. ```python def image_text(bucket, key, sort_column='', parents=True): response = rekognition.detect_text(Image={'S3Object':{'Bucket':bucket,'Name': key}}) df = pd.read_json(io.StringIO(json.dumps(response['TextDetections']))) df['Width'] = df['Geometry'].apply(lambda x: x['BoundingBox']['Width']) df['Height'] = df['Geometry'].apply(lambda x: x['BoundingBox']['Height']) df['Left'] = df['Geometry'].apply(lambda x: x['BoundingBox']['Left']) df['Top'] = df['Geometry'].apply(lambda x: x['BoundingBox']['Top']) df = df.drop(columns=['Geometry']) if sort_column: df = df.sort_values([sort_column]) if not parents: df = df[df['ParentId'] > 0] return df ``` ```python show_image(image_bucket, 'images/nyc-taxi-signs.jpeg', 1024) ```

Sorting on ``Top`` column will keep the horizontal text together. ```python image_text(image_bucket, 'images/nyc-taxi-signs.jpeg', sort_column='Top', parents=False) ```

	Confidence	DetectedText	Id	ParentId	Type	Width	Height	Left	Top
14	91.874588	WAY	15	1.0	WORD	0.028470	0.019385	0.599400	0.109109
15	83.133957	6ASW	14	1.0	WORD	0.034089	0.018404	0.570143	0.126126
17	94.518997	HAN'S	17	2.0	WORD	0.070971	0.032111	0.388597	0.187187
16	99.643578	DELI	16	2.0	WORD	0.080892	0.041151	0.281320	0.201201
18	90.439888	&	18	3.0	WORD	0.027007	0.044044	0.364591	0.212212
19	99.936119	GROCERY	19	3.0	WORD	0.150999	0.042149	0.399850	0.217217
20	81.925537	ZiGi	20	4.0	WORD	0.027007	0.023035	0.595649	0.265265
21	95.180290	SHOES	21	5.0	WORD	0.041695	0.019078	0.621906	0.269269
22	91.584435	X29CONEYSL	23	5.0	WORD	0.108448	0.038509	0.887472	0.279279
23	90.353638	x29	24	5.0	WORD	0.038896	0.033245	0.888972	0.282282
24	96.308746	647	22	5.0	WORD	0.018755	0.016016	0.747937	0.293293
25	97.540222	BROADWAY	25	6.0	WORD	0.055210	0.018034	0.768192	0.295295
26	89.723869	NEW	27	7.0	WORD	0.033758	0.019019	0.587397	0.379379
27	92.452881	YORK	28	7.0	WORD	0.035273	0.020034	0.618905	0.382382
29	92.044113	CITY	29	7.0	WORD	0.027007	0.016016	0.655664	0.389389
28	95.421768	food	26	7.0	WORD	0.033758	0.024024	0.555889	0.392392
33	96.425499	WINE	33	8.0	WORD	0.043511	0.022022	0.592648	0.398398
31	87.556793	LIon	31	8.0	WORD	0.041260	0.030030	0.336084	0.400400
32	90.025482	KING	32	8.0	WORD	0.045022	0.033042	0.377344	0.400400
34	96.632484	FOOD	35	8.0	WORD	0.043522	0.021034	0.645911	0.402402
30	98.496071	THE	30	8.0	WORD	0.031508	0.023023	0.303826	0.403403
36	96.938141	FESTIVALS	34	8.0	WORD	0.090028	0.021034	0.596399	0.419419
35	71.623650	EME	36	9.0	WORD	0.029257	0.027027	0.450113	0.426426
37	88.608627	Oct.9-12	37	9.0	WORD	0.036773	0.016016	0.553638	0.437437
38	91.010559	SALE	38	9.0	WORD	0.023788	0.018158	0.735934	0.452452
39	80.209969	02	39	10.0	WORD	0.024027	0.021034	0.077269	0.488488
40	85.682373	9214'',	40	11.0	WORD	0.112688	0.028068	0.762191	0.600601
41	97.959709	TAXI	42	12.0	WORD	0.104583	0.052101	0.488372	0.716717
42	96.415970	NYC	41	12.0	WORD	0.066138	0.036067	0.414104	0.736737

### Detect Celebs Traffic analytics may also involve detecting VIP movement to divert traffic or monitor security events. Detecting VIP in a scene starts with facial recognition. Our function ``detect_celebs`` works as well with political figures as it will with movie celebrities. ```python def detect_celebs(bucket, key, sort_column=''): image_object = {'S3Object':{'Bucket': bucket,'Name': key}} response = rekognition.recognize_celebrities(Image=image_object) df = pd.DataFrame(response['CelebrityFaces']) df['Width'] = df['Face'].apply(lambda x: x['BoundingBox']['Width']) df['Height'] = df['Face'].apply(lambda x: x['BoundingBox']['Height']) df['Left'] = df['Face'].apply(lambda x: x['BoundingBox']['Left']) df['Top'] = df['Face'].apply(lambda x: x['BoundingBox']['Top']) df = df.drop(columns=['Face']) if sort_column: df = df.sort_values([sort_column]) return(df) ``` ```python show_image(image_bucket, 'images/world-leaders.jpg', 1024) ```

```python detect_celebs(image_bucket, 'images/world-leaders.jpg', sort_column='Left') ```

	Id	MatchConfidence	Name	Urls	Width	Height	Left	Top
3	4Ev8IX1	100.0	Chulabhorn	[]	0.020202	0.038973	0.015152	0.424905
5	3J795K	100.0	Manmohan Singh	[]	0.018687	0.035171	0.131313	0.420152
25	f0JR5e	90.0	Mahinda Rajapaksa	[]	0.016162	0.030418	0.145960	0.319392
30	3n7tl2O	88.0	Killah Priest	[www.imdb.com/name/nm0697334]	0.014646	0.027567	0.162121	0.290875
12	2gC0Tc0e	100.0	Rosen Plevneliev	[]	0.018182	0.034221	0.179293	0.367871
19	3LR2lb6j	56.0	Jerry Harrison	[]	0.017172	0.032319	0.227273	0.330798
1	4hD40O	100.0	Thomas Boni Yayi	[]	0.021717	0.040875	0.236364	0.399240
22	2F5LV4	63.0	Irwansyah	[www.imdb.com/name/nm2679097]	0.016667	0.031369	0.274747	0.340304
8	3hk2qj5G	98.0	Cristina Fernández de Kirchner	[www.imdb.com/name/nm3231417]	0.018687	0.035171	0.278283	0.414449
13	2sN1oC8s	100.0	Jorge Carlos Fonseca	[]	0.018182	0.034221	0.280808	0.370722
9	3Ns4kC2b	100.0	Sebastián Piñera	[]	0.018687	0.035171	0.318687	0.374525
15	1qy7Yt8D	100.0	Gurbanguly Berdimuhamedow	[]	0.018182	0.034221	0.334848	0.317490
4	1eA7EJ2W	63.0	Salim Durani	[]	0.019192	0.036122	0.418687	0.331749
20	2vr4uV3M	95.0	Albert II, Prince of Monaco	[]	0.017172	0.032319	0.463636	0.332700
29	4pv6OP8	90.0	Nick Clegg	[www.imdb.com/name/nm2200958]	0.015152	0.028517	0.465152	0.255703
7	pL8KD9X	100.0	Denis Sassou Nguesso	[]	0.018687	0.035171	0.472727	0.368821
0	46JZ2c	97.0	Ban Ki-moon	[www.imdb.com/name/nm2559634]	0.022727	0.042776	0.526768	0.402091
27	2yG8Fe4x	79.0	Mem Fox	[]	0.015152	0.028517	0.607071	0.351711
18	2nk8Bd0	58.0	Ali Bongo Ondimba	[]	0.017172	0.032319	0.612121	0.381179
2	2aE2DV3K	100.0	Susilo Bambang Yudhoyono	[www.imdb.com/name/nm2670444]	0.020707	0.038973	0.626263	0.403042
17	3m4lC0	82.0	Uhuru Kenyatta	[www.imdb.com/name/nm6045979]	0.017172	0.032319	0.650505	0.343156
28	K8hL4i	67.0	Erkki Tuomioja	[]	0.015152	0.028517	0.657071	0.280418
26	2KJ7KM8e	100.0	Isatou Njie-Saidy	[]	0.015657	0.029468	0.666162	0.396388
14	aU4fU4	100.0	Laura Chinchilla	[]	0.018182	0.034221	0.679798	0.429658
16	2DM2OT1F	91.0	Alpha Condé	[]	0.017677	0.033270	0.708586	0.369772
11	4eh5t9f	99.0	Helle Thorning-Schmidt	[www.imdb.com/name/nm1525284]	0.018182	0.034221	0.723232	0.399240
21	Em8cA8q	70.0	Ollanta Humala	[]	0.017172	0.032319	0.766667	0.355513
24	4FT4On6a	94.0	Mariano Rajoy	[www.imdb.com/name/nm1775577]	0.016162	0.030418	0.786869	0.282319
23	1oa5Af1	73.0	James Van Praagh	[www.imdb.com/name/nm1070530]	0.016667	0.031369	0.806061	0.378327
10	47mP82	82.0	János Áder	[]	0.018182	0.034221	0.848485	0.365970
6	16BU2ey	99.0	José Manuel Barroso	[]	0.018687	0.035171	0.960606	0.408745

### Comprehend Syntax It is possible that many data sources represent natural language and free text. Understand structure and semantics from this unstructured text can help further our open data analytics use cases. Let us assume we are processing traffic updates for structured data so we can take appropriate actions. First step in understanding natural language is to break it up into grammaticaly syntax. Nouns like "today" can tell about a particular event like when is the event occuring. Adjectives like "snowing" and "windy" tell what is happening at that moment in time. ```python comprehend = boto3.client('comprehend', 'us-east-1') traffic_update = """ It is snowing and windy today in New York. The temperature is 50 degrees Fahrenheit. The traffic is slow 10 mph with several jams along the I-86. """ ``` ```python def comprehend_syntax(text): response = comprehend.detect_syntax(Text=text, LanguageCode='en') df = pd.read_json(io.StringIO(json.dumps(response['SyntaxTokens']))) df['Tag'] = df['PartOfSpeech'].apply(lambda x: x['Tag']) df['Score'] = df['PartOfSpeech'].apply(lambda x: x['Score']) df = df.drop(columns=['PartOfSpeech']) return df ``` ```python comprehend_syntax(traffic_update) ```

	BeginOffset	EndOffset	Text	TokenId	Tag	Score
0	1	3	It	1	PRON	0.999971
1	4	6	is	2	VERB	0.557677
2	7	14	snowing	3	ADJ	0.687805
3	15	18	and	4	CONJ	0.999998
4	19	24	windy	5	ADJ	0.994336
5	25	30	today	6	NOUN	0.999980
6	31	33	in	7	ADP	0.999924
7	34	37	New	8	PROPN	0.999351
8	38	42	York	9	PROPN	0.998399
9	42	43	.	10	PUNCT	0.999998
10	44	47	The	11	DET	0.999979
11	48	59	temperature	12	NOUN	0.999760
12	60	62	is	13	VERB	0.998011
13	63	65	50	14	NUM	0.999716
14	66	73	degrees	15	NOUN	0.999700
15	74	84	Fahrenheit	16	PROPN	0.950743
16	84	85	.	17	PUNCT	0.999994
17	87	90	The	18	DET	0.999975
18	91	98	traffic	19	NOUN	0.999450
19	99	101	is	20	VERB	0.965014
20	102	106	slow	21	ADJ	0.815718
21	107	109	10	22	NUM	0.999991
22	110	113	mph	23	NOUN	0.988531
23	114	118	with	24	ADP	0.973397
24	119	126	several	25	ADJ	0.999647
25	127	131	jams	26	NOUN	0.999936
26	132	137	along	27	ADP	0.997718
27	138	141	the	28	DET	0.999960
28	142	143	I	29	PROPN	0.745183
29	143	144	-	30	PUNCT	0.999858
30	144	146	86	31	PROPN	0.684016
31	146	147	.	32	PUNCT	0.999985

### Comprehend Entities More insights can be derived by doing entity extraction from the natural langauage. These entities can be date, location, quantity, among others. Just few of the entities can tell a structured story to a program. ```python def comprehend_entities(text): response = comprehend.detect_entities(Text=text, LanguageCode='en') df = pd.read_json(io.StringIO(json.dumps(response['Entities']))) return df ``` ```python comprehend_entities(traffic_update) ```

	BeginOffset	EndOffset	Score	Text	Type
0	25	30	0.839589	today	DATE
1	34	42	0.998423	New York	LOCATION
2	63	84	0.984396	50 degrees Fahrenheit	QUANTITY
3	107	113	0.992498	10 mph	QUANTITY
4	142	146	0.990993	I-86	LOCATION

### Comprehend Phrases Analysis of phrases within narutal language text complements the other two methods for a program to better route the actions based on derived structure of the event. ```python def comprehend_phrases(text): response = comprehend.detect_key_phrases(Text=text, LanguageCode='en') df = pd.read_json(io.StringIO(json.dumps(response['KeyPhrases']))) return df ``` ```python comprehend_phrases(traffic_update) ```

	BeginOffset	EndOffset	Score	Text
0	25	30	0.988285	today
1	34	42	0.997397	New York
2	44	59	0.999752	The temperature
3	63	73	0.789843	50 degrees
4	87	98	0.999843	The traffic
5	107	113	0.924737	10 mph
6	119	131	0.998428	several jams
7	138	146	0.997108	the I-86

### Comprehend Sentiment Sentiment analysis is common for social media user generated content. Sentiment can give us signals on the users' mood when publishing such social data. ```python def comprehend_sentiment(text): response = comprehend.detect_sentiment(Text=text, LanguageCode='en') return response['SentimentScore'] ``` ```python comprehend_sentiment(traffic_update) ``` {'Positive': 0.04090394824743271, 'Negative': 0.3745909333229065, 'Neutral': 0.5641733407974243, 'Mixed': 0.020331736654043198} ### Change Log This section captures changes and updates to this notebook across releases. #### Usability and sorting for text and face detection - Release 3 MAY 2019 Functions ``image_text`` and ``detect_celeb`` can now sort results based on a column name. Function ``image_text`` can optionally show results without parent-child relations. Usability update for ``comprehend_syntax`` function to split ``part of speech`` dictionary value into separate Tag and Score columns. #### Launch - Release 30 APR 2019 This is the launch release which builds the AWS Open Data Analytics API for using AWS AI services to analyze public data. --- #### Using AI Services for Analyzing Public Data by Manav Sehgal | on APR 30 2019