# Amazon Comprehend Events Finance Tutorial

This notebook is the intended companion to the Amazon Machine Learning blog post entitled, "[Announcing the launch of Amazon Comprehend Events](http://TBD)." It includes step-by-step instructions for submitting documents to the Comprehend Events Asynchronous API, understanding the system predictions made by the service, and performing a number of transformations and visualizations of the data for analytic purposes.

## Setup

In [1]:
%pip install -r requirements.txt > /dev/null

Note: you may need to restart the kernel to use updated packages.


In [2]:
%matplotlib inline

import json
import requests
import uuid

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
import boto3
import smart_open

from time import sleep
from matplotlib import cm, colors
from spacy import displacy
from collections import Counter
from pyvis.network import Network

### Write documents to S3

We've included a set of Amazon press releases as example documents. Here we upload them as a single file `sample_finance_dataset.txt` to an S3 bucket for processing. The same bucket will be used to return service output.

In [3]:
# Client and session information
session = boto3.Session()
s3_client = session.client(service_name="s3")

# Constants for S3 bucket and input data file
bucket = "comprehend-events-blogpost-us-east-1"
filename = "sample_finance_dataset.txt"
input_data_s3_path = f's3://{bucket}/' + filename
output_data_s3_path = f's3://{bucket}/'

# Upload the local file to S3
s3_client.upload_file("../data/" + filename, bucket, filename)

# Load the documents locally for later analysis
with open("../data/" + filename, "r") as fi:
    raw_texts = [line.strip() for line in fi.readlines()]

### Start an asynchronous job with the SDK

The first task is to kick off the inference job. We'll do this with the `start_events_detection_job` endpoint. Note that the API requires an IAM role with List, Read, and Write access to the bucket specified above.

In [4]:
# Comprehend client information
comprehend_client = session.client(service_name="comprehend")

# IAM role with access to Comprehend and specified S3 buckets
job_data_access_role = 'arn:aws:iam::xxxxxxxxxxxxx:role/service-role/AmazonComprehendServiceRole-test-events-role'

# Other job parameters
input_data_format = 'ONE_DOC_PER_LINE'
job_uuid = uuid.uuid1()
job_name = f"events-job-{job_uuid}"
event_types = ["BANKRUPTCY", "EMPLOYMENT", "CORPORATE_ACQUISITION", 
               "INVESTMENT_GENERAL", "CORPORATE_MERGER", "IPO",
               "RIGHTS_ISSUE", "SECONDARY_OFFERING", "SHELF_OFFERING",
               "TENDER_OFFERING", "STOCK_SPLIT"]

In [5]:
# Begin the inference job
response = comprehend_client.start_events_detection_job(
    InputDataConfig={'S3Uri': input_data_s3_path,
                     'InputFormat': input_data_format},
    OutputDataConfig={'S3Uri': output_data_s3_path},
    DataAccessRoleArn=job_data_access_role,
    JobName=job_name,
    LanguageCode='en',
    TargetEventTypes=event_types
)

# Get the job ID
events_job_id = response['JobId']

### Collect the results from S3

We poll the service with the `describe_events_detection_job` endpoint. Note that, as an asynchronous inference job, the task will take several minutes to complete. 

In [6]:
# Get current job status
job = comprehend_client.describe_events_detection_job(JobId=events_job_id)

# Loop until job is completed
waited = 0
timeout_minutes = 30
while job['EventsDetectionJobProperties']['JobStatus'] != 'COMPLETED':
    sleep(60)
    waited += 60
    assert waited//60 < timeout_minutes, "Job timed out after %d seconds." % waited
    job = comprehend_client.describe_events_detection_job(JobId=events_job_id)

In [7]:
# The output filename is the input filename + ".out"
output_data_s3_file = job['EventsDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'

# Load the output into a result dictionary    # Get the files.
results = []
with smart_open.open(output_data_s3_file) as fi:
    results.extend([json.loads(line) for line in fi.readlines() if line])

## Analyzing Comprehend Events output

The remainder of this notebook provides examples of different ways to analyze a given document. For our example document, we'll use the kind of online posting that a Financial analyst might consume when projecting market trends, a [2017 press release about Amazon's acquisition of Whole Foods Market, Inc.](https://press.aboutamazon.com/news-releases/news-release-details/amazoncom-announces-third-quarter-sales-34-437-billion). It's the first document in the data set we submitted to the Comprehend Events API.

> Amazon.com, Inc. (NASDAQ: AMZN) today announced financial results for its third quarter ended September 30, 2017.

> Operating cash flow increased 14% to \\$17.1 billion for the trailing twelve months, compared with \\$15.0 billion for the trailing twelve months ended September 30, 2016. Free cash flow decreased to \\$8.1 billion for the trailing twelve months, compared with \\$9.0 billion for the trailing twelve months ended September 30, 2016. Free cash flow less lease principal repayments decreased to \\$3.5 billion for the trailing twelve months, compared with \\$5.3 billion for the trailing twelve months ended September 30, 2016. Free cash flow less finance lease principal repayments and assets acquired under capital leases decreased to an outflow of \\$1.0 billion for the trailing twelve months, compared with an inflow of \\$3.8 billion for the trailing twelve months ended September 30, 2016.

> Common shares outstanding plus shares underlying stock-based awards totaled 503 million on September 30, 2017, compared with 496 million one year ago.

> Net sales increased 34% to \\$43.7 billion in the third quarter, compared with \\$32.7 billion in third quarter 2016. Net sales includes \\$1.3 billion from Whole Foods Market, which Amazon acquired on August 28, 2017. Excluding Whole Foods Market and the \\$124 million favorable impact from year-over-year changes in foreign exchange rates throughout the quarter, net sales increased 29% compared with third quarter 2016.

> Operating income decreased 40% to \\$347 million in the third quarter, compared with operating income of \\$575 million in third quarter 2016. Operating income includes income of \\$21 million from Whole Foods Market.

> Net income was \\$256 million in the third quarter, or \\$0.52 per diluted share, compared with net income of \\$252 million, or \\$0.52 per diluted share, in third quarter 2016.

> “In the last month alone, we’ve launched five new Alexa-enabled devices, introduced Alexa in India, announced integration with BMW, surpassed 25,000 skills, integrated Alexa with Sonos speakers, taught Alexa to distinguish between two voices, and more. Because Alexa’s brain is in the AWS cloud, her new abilities are available to all Echo customers, not just those who buy a new device,” said Jeff Bezos, Amazon founder and CEO. “And it’s working — customers have purchased tens of millions of Alexa-enabled devices, given Echo devices over 100,000 5-star reviews, and active customers are up more than 5x since the same time last year. With thousands of developers and hardware makers building new Alexa skills and devices, the Alexa experience will continue to get even better.”

### Understanding Comprehend Events system output

The system returns JSON output for each submitted document. The structure of a response is shown below. Note:

* Events system output contains separate objects for `Entities` and `Events`, each organized into groups of coreferential object.  
* Two additional fields, `File` and `Line` will be present as well to track document provenance.

In [8]:
# Use the first result document for analysis
result = results[0]
raw_text = raw_texts[0]

In [9]:
raw_text

"Amazon (NASDAQ:AMZN) and Whole Foods Market, Inc. (NASDAQ:WFM) today announced that they have entered into a definitive merger agreement under which Amazon will acquire Whole Foods Market for $42 per share in an all-cash transaction valued at approximately $13.7 billion, including Whole Foods Market’s net debt.  “Millions of people love Whole Foods Market because they offer the best natural and organic foods, and they make it fun to eat healthy,” said Jeff Bezos, Amazon founder and CEO. “Whole Foods Market has been satisfying, delighting and nourishing customers for nearly four decades – they’re doing an amazing job and we want that to continue.”  “This partnership presents an opportunity to maximize value for Whole Foods Market’s shareholders, while at the same time extending our mission and bringing the highest quality, experience, convenience and innovation to our customers,” said John Mackey, Whole Foods Market co-founder and CEO.  Whole Foods Market will continue to operate store

In [10]:
result

{'Entities': [{'Mentions': [{'BeginOffset': 0,
     'EndOffset': 6,
     'GroupScore': 1.0,
     'Score': 0.999501,
     'Text': 'Amazon',
     'Type': 'ORGANIZATION'},
    {'BeginOffset': 149,
     'EndOffset': 155,
     'GroupScore': 0.9936,
     'Score': 0.999615,
     'Text': 'Amazon',
     'Type': 'ORGANIZATION'},
    {'BeginOffset': 468,
     'EndOffset': 474,
     'GroupScore': 0.584694,
     'Score': 0.998912,
     'Text': 'Amazon',
     'Type': 'ORGANIZATION'}]},
  {'Mentions': [{'BeginOffset': 8,
     'EndOffset': 19,
     'GroupScore': 1.0,
     'Score': 0.990119,
     'Text': 'NASDAQ:AMZN',
     'Type': 'STOCK_CODE'}]},
  {'Mentions': [{'BeginOffset': 25,
     'EndOffset': 49,
     'GroupScore': 1.0,
     'Score': 0.999654,
     'Text': 'Whole Foods Market, Inc.',
     'Type': 'ORGANIZATION'},
    {'BeginOffset': 169,
     'EndOffset': 187,
     'GroupScore': 0.990907,
     'Score': 0.999668,
     'Text': 'Whole Foods Market',
     'Type': 'ORGANIZATION'},
    {'BeginOffset

#### Events are groups of Triggers

* The API output includes the text, character offset, and type of each trigger.  

* Confidence scores for classification tasks are given as `Score`. Confidence of event group membership is given with `GroupScore`.  

In [11]:
result['Events'][1]['Triggers']

[{'BeginOffset': 161,
  'EndOffset': 168,
  'GroupScore': 1.0,
  'Score': 0.999958,
  'Text': 'acquire',
  'Type': 'CORPORATE_ACQUISITION'},
 {'BeginOffset': 221,
  'EndOffset': 232,
  'GroupScore': 0.999985,
  'Score': 0.931137,
  'Text': 'transaction',
  'Type': 'CORPORATE_ACQUISITION'}]

#### Arguments are linked to Entities by EntityIndex

* The API also return the classification confidence of the role assignment.

In [12]:
result['Events'][1]['Arguments']

[{'EntityIndex': 5, 'Role': 'AMOUNT', 'Score': 0.99873},
 {'EntityIndex': 4, 'Role': 'DATE', 'Score': 0.994578},
 {'EntityIndex': 2, 'Role': 'INVESTEE', 'Score': 0.999668},
 {'EntityIndex': 0, 'Role': 'INVESTOR', 'Score': 0.999615}]

#### Entities are groups of Mentions

* The API output includes the text, character offset, and type of each mention.  

* Confidence scores for classification tasks are given as `Score`. Confidence of entity group membership is given with `GroupScore`.  

In [13]:
result['Entities'][0]['Mentions']

[{'BeginOffset': 0,
  'EndOffset': 6,
  'GroupScore': 1.0,
  'Score': 0.999501,
  'Text': 'Amazon',
  'Type': 'ORGANIZATION'},
 {'BeginOffset': 149,
  'EndOffset': 155,
  'GroupScore': 0.9936,
  'Score': 0.999615,
  'Text': 'Amazon',
  'Type': 'ORGANIZATION'},
 {'BeginOffset': 468,
  'EndOffset': 474,
  'GroupScore': 0.584694,
  'Score': 0.998912,
  'Text': 'Amazon',
  'Type': 'ORGANIZATION'}]

### Visualizing the Events and Entities

In the remainder of the notebook, we'll give a number of tabulations and visualizations to help understand what the API is returning.

First we'll consider visualization of spans, both triggers and entity mentions. One of the most essential visualization tasks for sequence labeling tasks is highlighting of tagged text in documents. For demo purposes, we'll do this with [displaCy](https://spacy.io/usage/visualizers).

In [14]:
# Convert Events output to displaCy format.
entities = [
    {'start': m['BeginOffset'], 'end': m['EndOffset'], 'label': m['Type']}
    for e in result['Entities']
    for m in e['Mentions']
]

triggers = [
    {'start': t['BeginOffset'], 'end': t['EndOffset'], 'label': t['Type']}
    for e in result['Events']
    for t in e['Triggers']
]

# Spans need to be sorted for displaCy to process them correctly
spans = sorted(entities + triggers, key=lambda x: x['start'])
tags = [s['label'] for s in spans]

output = [{"text": raw_text, "ents": spans, "title": None, "settings": {}}]

In [15]:
# Misc. objects for presentation purposes
spectral = cm.get_cmap("Spectral", len(tags))
tag_colors = [colors.rgb2hex(spectral(i)) for i in range(len(tags))]
color_map = dict(zip(*(tags, tag_colors)))

In [16]:
# Note that only Entities participating in Events are shown.
displacy.render(output, style="ent", options={"colors": color_map}, manual=True)

### Rendering as tabular data

Many users will use Events to create structured data from unstructured text. Here we'll demonstrate how to do this with `pandas`. First, we flatten hierarchical JSON to pandas dataframe. 

In [17]:
# Creation of the entity dataframe. Entity indices must be explicitly created.
entities_df = pd.DataFrame([
    {"EntityIndex": i, **m}
    for i, e in enumerate(result['Entities'])
    for m in e['Mentions']
])

# Creation of the events dataframe. Event indices must be explicitly created.
events_df = pd.DataFrame([
    {"EventIndex": i, **a, **t}
    for i, e in enumerate(result['Events'])
    for a in e['Arguments']
    for t in e['Triggers']
])

# Join the two tables into one flat data structure.
events_df = events_df.merge(entities_df, on="EntityIndex", suffixes=('Event', 'Entity'))

In [18]:
events_df

Unnamed: 0,EventIndex,EntityIndex,Role,ScoreEvent,BeginOffsetEvent,EndOffsetEvent,GroupScoreEvent,TextEvent,TypeEvent,BeginOffsetEntity,EndOffsetEntity,GroupScoreEntity,ScoreEntity,TextEntity,TypeEntity
0,0,4,DATE,0.999611,120,126,1.000000,merger,CORPORATE_MERGER,63,68,1.000000,0.994578,today,DATE
1,0,4,DATE,0.999829,662,673,0.999969,partnership,CORPORATE_MERGER,63,68,1.000000,0.994578,today,DATE
2,0,4,DATE,0.992193,1237,1248,0.509698,transaction,CORPORATE_MERGER,63,68,1.000000,0.994578,today,DATE
3,0,4,DATE,0.998367,1403,1414,0.336709,transaction,CORPORATE_MERGER,63,68,1.000000,0.994578,today,DATE
4,1,4,DATE,0.999958,161,168,1.000000,acquire,CORPORATE_ACQUISITION,63,68,1.000000,0.994578,today,DATE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132,1,0,INVESTOR,0.931137,221,232,0.999985,transaction,CORPORATE_ACQUISITION,468,474,0.584694,0.998912,Amazon,ORGANIZATION
133,2,6,EMPLOYEE,0.999938,1116,1122,1.000000,remain,EMPLOYMENT,897,908,1.000000,0.999606,John Mackey,PERSON
134,2,6,EMPLOYEE,0.999938,1116,1122,1.000000,remain,EMPLOYMENT,1099,1110,0.977111,0.999699,John Mackey,PERSON
135,2,7,EMPLOYEE_TITLE,0.999938,1116,1122,1.000000,remain,EMPLOYMENT,944,947,1.000000,0.997071,CEO,PERSON_TITLE


### A more succinct representation

We're primarity interested in the *event structure*, so let's make that more transparent by creating a new table with Roles as column headers, grouped by Event.

In [19]:
def format_compact_events(x):
    """Collapse groups of mentions and triggers into a single set."""
    # Take the most commonly occurring EventType and the set of triggers.
    d = {"EventType": Counter(x['TypeEvent']).most_common()[0][0],
         "Triggers": set(x['TextEvent'])}
    # For each argument Role, collect the set of mentions in the group.
    for role in x['Role']:
        d.update({role: set((x[x['Role']==role]['TextEntity']))})
    return d

# Group data by EventIndex and format.
event_analysis_df = pd.DataFrame(
    events_df.groupby("EventIndex").apply(format_compact_events).tolist()
).fillna('')

In [20]:
event_analysis_df

Unnamed: 0,EventType,Triggers,DATE,PARTICIPANT,INVESTEE,AMOUNT,INVESTOR,EMPLOYER,EMPLOYEE,EMPLOYEE_TITLE
0,CORPORATE_MERGER,"{transaction, partnership, merger}","{during the second half of 2017, today}","{they, NASDAQ:WFM, we, Whole Foods Market, Who...",,,,,,
1,CORPORATE_ACQUISITION,"{acquire, transaction}",{today},,"{we, they, Whole Foods Market, Whole Foods Mar...","{$13.7 billion, $42}",{Amazon},,,
2,EMPLOYMENT,{remain},,,,,,"{we, they, Whole Foods Market, Whole Foods Mar...",{John Mackey},{CEO}


### Graphing event semantics

The most striking representation of Comprehend Events output is found in a semantic graph, a network of the entities and events referenced in a document or documents. The code below uses two open source libraries, `networkx` and `pyvis`, to render events system output. In the resulting graph, nodes are entity mentions and triggers, while edges are the argument roles held by the entities in relation to the triggers.

#### Formatting the data

System output must first be conformed to the node (i.e., vertex) and edge list format required by `networkx`. This requires iterating over triggers, entities, and argument structural relations. Note that we can use the `GroupScore` and `Score` keys on various objects to prune nodes and edges in which the model has less confidence. We can also use various strategies to pick a 'canonical' mention from each mention group to appear in the graph; here we chose the mention with the string-wise longest extent.

In [21]:
# Entities are associated with events by group, not individual mention; for simplicity, 
# assume the canonical mention is the longest one.
def get_canonical_mention(mentions):
    extents = enumerate([m['Text'] for m in mentions])
    longest_name = sorted(extents, key=lambda x: len(x[1]))
    return [mentions[longest_name[-1][0]]]

# Set a global confidence threshold
thr = 0.5

# Nodes are (id, type, tag, score, mention_type) tuples.
trigger_nodes = [
    ("tr%d" % i, t['Type'], t['Text'], t['Score'], "trigger")
    for i, e in enumerate(result['Events'])
    for t in e['Triggers'][:1]
    if t['GroupScore'] > thr
]
entity_nodes = [
    ("en%d" % i, m['Type'], m['Text'], m['Score'], "entity")
    for i, e in enumerate(result['Entities'])
    for m in get_canonical_mention(e['Mentions'])
    if m['GroupScore'] > thr
]

# Edges are (trigger_id, node_id, role, score) tuples.
argument_edges = [
    ("tr%d" % i, "en%d" % a['EntityIndex'], a['Role'], a['Score'])
    for i, e in enumerate(result['Events'])
    for a in e['Arguments']
    if a['Score'] > thr
]    

#### Create a compact graph

Once the nodes and edges are defines, we can create and visualize the graph.

In [22]:
G = nx.Graph()

# Iterate over triggers and entity mentions.
for mention_id, tag, extent, score, mtype in trigger_nodes + entity_nodes:
    label = extent if mtype.startswith("entity") else tag
    G.add_node(mention_id, label=label, size=score*10, color=color_map[tag], tag=tag, group=mtype)
    
# Iterate over argument role assignments
for event_id, entity_id, role, score in argument_edges:
    G.add_edges_from(
        [(event_id, entity_id)],
        label=role,
        weight=score*100,
        color="grey"
    )

# Drop mentions that don't participate in events
G.remove_nodes_from(list(nx.isolates(G)))

In [23]:
nt = Network("600px", "800px", notebook=True, heading="")
nt.from_nx(G)
nt.show("compact_nx.html")

#### A more complete graph

The graph above is compact, only relaying essential event type and argument role information. We can use a slightly more complicated set of functions to graph all of the information returned by the API.

In [28]:
# This convenience function in `events_graph.py` plots a complete graph of the document,
# showing all events, triggers, entities, and their groups.

import events_graph as evg

evg.plot(result, node_types=['event', 'trigger', 'entity_group', 'entity'], thr=0.5)