{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Amazon Comprehend Events Finance Tutorial\n", "\n", "This notebook is the intended companion to the Amazon Machine Learning blog post entitled, \"[Announcing the launch of Amazon Comprehend Events](http://TBD).\" It includes step-by-step instructions for submitting documents to the Comprehend Events Asynchronous API, understanding the system predictions made by the service, and performing a number of transformations and visualizations of the data for analytic purposes." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install -r requirements.txt > /dev/null" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import json\n", "import requests\n", "import uuid\n", "\n", "import networkx as nx\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import boto3\n", "import smart_open\n", "\n", "from time import sleep\n", "from matplotlib import cm, colors\n", "from spacy import displacy\n", "from collections import Counter\n", "from pyvis.network import Network" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write documents to S3\n", "\n", "We've included a set of Amazon press releases as example documents. Here we upload them as a single file `sample_finance_dataset.txt` to an S3 bucket for processing. The same bucket will be used to return service output." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Client and session information\n", "session = boto3.Session()\n", "s3_client = session.client(service_name=\"s3\")\n", "\n", "# Constants for S3 bucket and input data file\n", "bucket = \"comprehend-events-blogpost-us-east-1\"\n", "filename = \"sample_finance_dataset.txt\"\n", "input_data_s3_path = f's3://{bucket}/' + filename\n", "output_data_s3_path = f's3://{bucket}/'\n", "\n", "# Upload the local file to S3\n", "s3_client.upload_file(\"../data/\" + filename, bucket, filename)\n", "\n", "# Load the documents locally for later analysis\n", "with open(\"../data/\" + filename, \"r\") as fi:\n", " raw_texts = [line.strip() for line in fi.readlines()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start an asynchronous job with the SDK\n", "\n", "The first task is to kick off the inference job. We'll do this with the `start_events_detection_job` endpoint. Note that the API requires an IAM role with List, Read, and Write access to the bucket specified above." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Comprehend client information\n", "comprehend_client = session.client(service_name=\"comprehend\")\n", "\n", "# IAM role with access to Comprehend and specified S3 buckets\n", "job_data_access_role = 'arn:aws:iam::xxxxxxxxxxxxx:role/service-role/AmazonComprehendServiceRole-test-events-role'\n", "\n", "# Other job parameters\n", "input_data_format = 'ONE_DOC_PER_LINE'\n", "job_uuid = uuid.uuid1()\n", "job_name = f\"events-job-{job_uuid}\"\n", "event_types = [\"BANKRUPTCY\", \"EMPLOYMENT\", \"CORPORATE_ACQUISITION\", \n", " \"INVESTMENT_GENERAL\", \"CORPORATE_MERGER\", \"IPO\",\n", " \"RIGHTS_ISSUE\", \"SECONDARY_OFFERING\", \"SHELF_OFFERING\",\n", " \"TENDER_OFFERING\", \"STOCK_SPLIT\"]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Begin the inference job\n", "response = comprehend_client.start_events_detection_job(\n", " InputDataConfig={'S3Uri': input_data_s3_path,\n", " 'InputFormat': input_data_format},\n", " OutputDataConfig={'S3Uri': output_data_s3_path},\n", " DataAccessRoleArn=job_data_access_role,\n", " JobName=job_name,\n", " LanguageCode='en',\n", " TargetEventTypes=event_types\n", ")\n", "\n", "# Get the job ID\n", "events_job_id = response['JobId']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Collect the results from S3\n", "\n", "We poll the service with the `describe_events_detection_job` endpoint. Note that, as an asynchronous inference job, the task will take several minutes to complete. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Get current job status\n", "job = comprehend_client.describe_events_detection_job(JobId=events_job_id)\n", "\n", "# Loop until job is completed\n", "waited = 0\n", "timeout_minutes = 30\n", "while job['EventsDetectionJobProperties']['JobStatus'] != 'COMPLETED':\n", " sleep(60)\n", " waited += 60\n", " assert waited//60 < timeout_minutes, \"Job timed out after %d seconds.\" % waited\n", " job = comprehend_client.describe_events_detection_job(JobId=events_job_id)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# The output filename is the input filename + \".out\"\n", "output_data_s3_file = job['EventsDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'\n", "\n", "# Load the output into a result dictionary # Get the files.\n", "results = []\n", "with smart_open.open(output_data_s3_file) as fi:\n", " results.extend([json.loads(line) for line in fi.readlines() if line])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Analyzing Comprehend Events output\n", "\n", "The remainder of this notebook provides examples of different ways to analyze a given document. For our example document, we'll use the kind of online posting that a Financial analyst might consume when projecting market trends, a [2017 press release about Amazon's acquisition of Whole Foods Market, Inc.](https://press.aboutamazon.com/news-releases/news-release-details/amazoncom-announces-third-quarter-sales-34-437-billion). It's the first document in the data set we submitted to the Comprehend Events API." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "> Amazon.com, Inc. (NASDAQ: AMZN) today announced financial results for its third quarter ended September 30, 2017.\n", "\n", "> Operating cash flow increased 14% to \\\\$17.1 billion for the trailing twelve months, compared with \\\\$15.0 billion for the trailing twelve months ended September 30, 2016. Free cash flow decreased to \\\\$8.1 billion for the trailing twelve months, compared with \\\\$9.0 billion for the trailing twelve months ended September 30, 2016. Free cash flow less lease principal repayments decreased to \\\\$3.5 billion for the trailing twelve months, compared with \\\\$5.3 billion for the trailing twelve months ended September 30, 2016. Free cash flow less finance lease principal repayments and assets acquired under capital leases decreased to an outflow of \\\\$1.0 billion for the trailing twelve months, compared with an inflow of \\\\$3.8 billion for the trailing twelve months ended September 30, 2016.\n", "\n", "> Common shares outstanding plus shares underlying stock-based awards totaled 503 million on September 30, 2017, compared with 496 million one year ago.\n", "\n", "> Net sales increased 34% to \\\\$43.7 billion in the third quarter, compared with \\\\$32.7 billion in third quarter 2016. Net sales includes \\\\$1.3 billion from Whole Foods Market, which Amazon acquired on August 28, 2017. Excluding Whole Foods Market and the \\\\$124 million favorable impact from year-over-year changes in foreign exchange rates throughout the quarter, net sales increased 29% compared with third quarter 2016.\n", "\n", "> Operating income decreased 40% to \\\\$347 million in the third quarter, compared with operating income of \\\\$575 million in third quarter 2016. Operating income includes income of \\\\$21 million from Whole Foods Market.\n", "\n", "> Net income was \\\\$256 million in the third quarter, or \\\\$0.52 per diluted share, compared with net income of \\\\$252 million, or \\\\$0.52 per diluted share, in third quarter 2016.\n", "\n", "> “In the last month alone, we’ve launched five new Alexa-enabled devices, introduced Alexa in India, announced integration with BMW, surpassed 25,000 skills, integrated Alexa with Sonos speakers, taught Alexa to distinguish between two voices, and more. Because Alexa’s brain is in the AWS cloud, her new abilities are available to all Echo customers, not just those who buy a new device,” said Jeff Bezos, Amazon founder and CEO. “And it’s working — customers have purchased tens of millions of Alexa-enabled devices, given Echo devices over 100,000 5-star reviews, and active customers are up more than 5x since the same time last year. With thousands of developers and hardware makers building new Alexa skills and devices, the Alexa experience will continue to get even better.”" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Understanding Comprehend Events system output\n", "\n", "The system returns JSON output for each submitted document. The structure of a response is shown below. Note:\n", "\n", "* Events system output contains separate objects for `Entities` and `Events`, each organized into groups of coreferential object. \n", "* Two additional fields, `File` and `Line` will be present as well to track document provenance." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Use the first result document for analysis\n", "result = results[0]\n", "raw_text = raw_texts[0]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"Amazon (NASDAQ:AMZN) and Whole Foods Market, Inc. (NASDAQ:WFM) today announced that they have entered into a definitive merger agreement under which Amazon will acquire Whole Foods Market for $42 per share in an all-cash transaction valued at approximately $13.7 billion, including Whole Foods Market’s net debt. “Millions of people love Whole Foods Market because they offer the best natural and organic foods, and they make it fun to eat healthy,” said Jeff Bezos, Amazon founder and CEO. “Whole Foods Market has been satisfying, delighting and nourishing customers for nearly four decades – they’re doing an amazing job and we want that to continue.” “This partnership presents an opportunity to maximize value for Whole Foods Market’s shareholders, while at the same time extending our mission and bringing the highest quality, experience, convenience and innovation to our customers,” said John Mackey, Whole Foods Market co-founder and CEO. Whole Foods Market will continue to operate stores under the Whole Foods Market brand and source from trusted vendors and partners around the world. John Mackey will remain as CEO of Whole Foods Market and Whole Foods Market’s headquarters will stay in Austin, Texas. Completion of the transaction is subject to approval by Whole Foods Market's shareholders, regulatory approvals and other customary closing conditions. The parties expect to close the transaction during the second half of 2017.\"" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_text" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Entities': [{'Mentions': [{'BeginOffset': 0,\n", " 'EndOffset': 6,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.999501,\n", " 'Text': 'Amazon',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 149,\n", " 'EndOffset': 155,\n", " 'GroupScore': 0.9936,\n", " 'Score': 0.999615,\n", " 'Text': 'Amazon',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 468,\n", " 'EndOffset': 474,\n", " 'GroupScore': 0.584694,\n", " 'Score': 0.998912,\n", " 'Text': 'Amazon',\n", " 'Type': 'ORGANIZATION'}]},\n", " {'Mentions': [{'BeginOffset': 8,\n", " 'EndOffset': 19,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.990119,\n", " 'Text': 'NASDAQ:AMZN',\n", " 'Type': 'STOCK_CODE'}]},\n", " {'Mentions': [{'BeginOffset': 25,\n", " 'EndOffset': 49,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.999654,\n", " 'Text': 'Whole Foods Market, Inc.',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 169,\n", " 'EndOffset': 187,\n", " 'GroupScore': 0.990907,\n", " 'Score': 0.999668,\n", " 'Text': 'Whole Foods Market',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 282,\n", " 'EndOffset': 300,\n", " 'GroupScore': 0.61808,\n", " 'Score': 0.999653,\n", " 'Text': 'Whole Foods Market',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 339,\n", " 'EndOffset': 357,\n", " 'GroupScore': 0.379391,\n", " 'Score': 0.999708,\n", " 'Text': 'Whole Foods Market',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 366,\n", " 'EndOffset': 370,\n", " 'GroupScore': 0.277769,\n", " 'Score': 0.956068,\n", " 'Text': 'they',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 417,\n", " 'EndOffset': 421,\n", " 'GroupScore': 0.289622,\n", " 'Score': 0.689263,\n", " 'Text': 'they',\n", " 'Type': 'PERSON'},\n", " {'BeginOffset': 493,\n", " 'EndOffset': 511,\n", " 'GroupScore': 0.19549,\n", " 'Score': 0.999641,\n", " 'Text': 'Whole Foods Market',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 628,\n", " 'EndOffset': 630,\n", " 'GroupScore': 0.131498,\n", " 'Score': 0.990074,\n", " 'Text': 'we',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 720,\n", " 'EndOffset': 738,\n", " 'GroupScore': 0.167593,\n", " 'Score': 0.999636,\n", " 'Text': 'Whole Foods Market',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 910,\n", " 'EndOffset': 928,\n", " 'GroupScore': 0.148614,\n", " 'Score': 0.999493,\n", " 'Text': 'Whole Foods Market',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 950,\n", " 'EndOffset': 968,\n", " 'GroupScore': 0.146546,\n", " 'Score': 0.999717,\n", " 'Text': 'Whole Foods Market',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 1011,\n", " 'EndOffset': 1029,\n", " 'GroupScore': 0.123549,\n", " 'Score': 0.999779,\n", " 'Text': 'Whole Foods Market',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 1133,\n", " 'EndOffset': 1151,\n", " 'GroupScore': 0.108892,\n", " 'Score': 0.999813,\n", " 'Text': 'Whole Foods Market',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 1156,\n", " 'EndOffset': 1174,\n", " 'GroupScore': 0.103557,\n", " 'Score': 0.999765,\n", " 'Text': 'Whole Foods Market',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 1275,\n", " 'EndOffset': 1293,\n", " 'GroupScore': 0.096173,\n", " 'Score': 0.999802,\n", " 'Text': 'Whole Foods Market',\n", " 'Type': 'ORGANIZATION'}]},\n", " {'Mentions': [{'BeginOffset': 51,\n", " 'EndOffset': 61,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.980082,\n", " 'Text': 'NASDAQ:WFM',\n", " 'Type': 'STOCK_CODE'}]},\n", " {'Mentions': [{'BeginOffset': 63,\n", " 'EndOffset': 68,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.994578,\n", " 'Text': 'today',\n", " 'Type': 'DATE'}]},\n", " {'Mentions': [{'BeginOffset': 192,\n", " 'EndOffset': 195,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.99873,\n", " 'Text': '$42',\n", " 'Type': 'MONETARY_VALUE'},\n", " {'BeginOffset': 257,\n", " 'EndOffset': 270,\n", " 'GroupScore': 0.547346,\n", " 'Score': 0.999486,\n", " 'Text': '$13.7 billion',\n", " 'Type': 'MONETARY_VALUE'}]},\n", " {'Mentions': [{'BeginOffset': 897,\n", " 'EndOffset': 908,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.999606,\n", " 'Text': 'John Mackey',\n", " 'Type': 'PERSON'},\n", " {'BeginOffset': 1099,\n", " 'EndOffset': 1110,\n", " 'GroupScore': 0.977111,\n", " 'Score': 0.999699,\n", " 'Text': 'John Mackey',\n", " 'Type': 'PERSON'}]},\n", " {'Mentions': [{'BeginOffset': 944,\n", " 'EndOffset': 947,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.997071,\n", " 'Text': 'CEO',\n", " 'Type': 'PERSON_TITLE'},\n", " {'BeginOffset': 1126,\n", " 'EndOffset': 1129,\n", " 'GroupScore': 0.778198,\n", " 'Score': 0.998065,\n", " 'Text': 'CEO',\n", " 'Type': 'PERSON_TITLE'}]},\n", " {'Mentions': [{'BeginOffset': 1415,\n", " 'EndOffset': 1445,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.999491,\n", " 'Text': 'during the second half of 2017',\n", " 'Type': 'DATE'}]}],\n", " 'Events': [{'Arguments': [{'EntityIndex': 4,\n", " 'Role': 'DATE',\n", " 'Score': 0.994578},\n", " {'EntityIndex': 1, 'Role': 'PARTICIPANT', 'Score': 0.990119},\n", " {'EntityIndex': 3, 'Role': 'PARTICIPANT', 'Score': 0.980082},\n", " {'EntityIndex': 2, 'Role': 'PARTICIPANT', 'Score': 0.999654},\n", " {'EntityIndex': 8, 'Role': 'DATE', 'Score': 0.999491}],\n", " 'Triggers': [{'BeginOffset': 120,\n", " 'EndOffset': 126,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.999611,\n", " 'Text': 'merger',\n", " 'Type': 'CORPORATE_MERGER'},\n", " {'BeginOffset': 662,\n", " 'EndOffset': 673,\n", " 'GroupScore': 0.999969,\n", " 'Score': 0.999829,\n", " 'Text': 'partnership',\n", " 'Type': 'CORPORATE_MERGER'},\n", " {'BeginOffset': 1237,\n", " 'EndOffset': 1248,\n", " 'GroupScore': 0.509698,\n", " 'Score': 0.992193,\n", " 'Text': 'transaction',\n", " 'Type': 'CORPORATE_MERGER'},\n", " {'BeginOffset': 1403,\n", " 'EndOffset': 1414,\n", " 'GroupScore': 0.336709,\n", " 'Score': 0.998367,\n", " 'Text': 'transaction',\n", " 'Type': 'CORPORATE_MERGER'}],\n", " 'Type': 'CORPORATE_MERGER'},\n", " {'Arguments': [{'EntityIndex': 5, 'Role': 'AMOUNT', 'Score': 0.99873},\n", " {'EntityIndex': 4, 'Role': 'DATE', 'Score': 0.994578},\n", " {'EntityIndex': 2, 'Role': 'INVESTEE', 'Score': 0.999668},\n", " {'EntityIndex': 0, 'Role': 'INVESTOR', 'Score': 0.999615}],\n", " 'Triggers': [{'BeginOffset': 161,\n", " 'EndOffset': 168,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.999958,\n", " 'Text': 'acquire',\n", " 'Type': 'CORPORATE_ACQUISITION'},\n", " {'BeginOffset': 221,\n", " 'EndOffset': 232,\n", " 'GroupScore': 0.999985,\n", " 'Score': 0.931137,\n", " 'Text': 'transaction',\n", " 'Type': 'CORPORATE_ACQUISITION'}],\n", " 'Type': 'CORPORATE_ACQUISITION'},\n", " {'Arguments': [{'EntityIndex': 6, 'Role': 'EMPLOYEE', 'Score': 0.999699},\n", " {'EntityIndex': 7, 'Role': 'EMPLOYEE_TITLE', 'Score': 0.998065},\n", " {'EntityIndex': 2, 'Role': 'EMPLOYER', 'Score': 0.999813}],\n", " 'Triggers': [{'BeginOffset': 1116,\n", " 'EndOffset': 1122,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.999938,\n", " 'Text': 'remain',\n", " 'Type': 'EMPLOYMENT'}],\n", " 'Type': 'EMPLOYMENT'}],\n", " 'File': 'sample_finance_dataset.txt',\n", " 'Line': 0}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Events are groups of Triggers\n", "\n", "* The API output includes the text, character offset, and type of each trigger. \n", "\n", "* Confidence scores for classification tasks are given as `Score`. Confidence of event group membership is given with `GroupScore`. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "[{'BeginOffset': 161,\n", " 'EndOffset': 168,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.999958,\n", " 'Text': 'acquire',\n", " 'Type': 'CORPORATE_ACQUISITION'},\n", " {'BeginOffset': 221,\n", " 'EndOffset': 232,\n", " 'GroupScore': 0.999985,\n", " 'Score': 0.931137,\n", " 'Text': 'transaction',\n", " 'Type': 'CORPORATE_ACQUISITION'}]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result['Events'][1]['Triggers']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Arguments are linked to Entities by EntityIndex\n", "\n", "* The API also return the classification confidence of the role assignment." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "[{'EntityIndex': 5, 'Role': 'AMOUNT', 'Score': 0.99873},\n", " {'EntityIndex': 4, 'Role': 'DATE', 'Score': 0.994578},\n", " {'EntityIndex': 2, 'Role': 'INVESTEE', 'Score': 0.999668},\n", " {'EntityIndex': 0, 'Role': 'INVESTOR', 'Score': 0.999615}]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result['Events'][1]['Arguments']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Entities are groups of Mentions\n", "\n", "* The API output includes the text, character offset, and type of each mention. \n", "\n", "* Confidence scores for classification tasks are given as `Score`. Confidence of entity group membership is given with `GroupScore`. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "[{'BeginOffset': 0,\n", " 'EndOffset': 6,\n", " 'GroupScore': 1.0,\n", " 'Score': 0.999501,\n", " 'Text': 'Amazon',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 149,\n", " 'EndOffset': 155,\n", " 'GroupScore': 0.9936,\n", " 'Score': 0.999615,\n", " 'Text': 'Amazon',\n", " 'Type': 'ORGANIZATION'},\n", " {'BeginOffset': 468,\n", " 'EndOffset': 474,\n", " 'GroupScore': 0.584694,\n", " 'Score': 0.998912,\n", " 'Text': 'Amazon',\n", " 'Type': 'ORGANIZATION'}]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result['Entities'][0]['Mentions']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Visualizing the Events and Entities\n", "\n", "In the remainder of the notebook, we'll give a number of tabulations and visualizations to help understand what the API is returning.\n", "\n", "First we'll consider visualization of spans, both triggers and entity mentions. One of the most essential visualization tasks for sequence labeling tasks is highlighting of tagged text in documents. For demo purposes, we'll do this with [displaCy](https://spacy.io/usage/visualizers)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# Convert Events output to displaCy format.\n", "entities = [\n", " {'start': m['BeginOffset'], 'end': m['EndOffset'], 'label': m['Type']}\n", " for e in result['Entities']\n", " for m in e['Mentions']\n", "]\n", "\n", "triggers = [\n", " {'start': t['BeginOffset'], 'end': t['EndOffset'], 'label': t['Type']}\n", " for e in result['Events']\n", " for t in e['Triggers']\n", "]\n", "\n", "# Spans need to be sorted for displaCy to process them correctly\n", "spans = sorted(entities + triggers, key=lambda x: x['start'])\n", "tags = [s['label'] for s in spans]\n", "\n", "output = [{\"text\": raw_text, \"ents\": spans, \"title\": None, \"settings\": {}}]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# Misc. objects for presentation purposes\n", "spectral = cm.get_cmap(\"Spectral\", len(tags))\n", "tag_colors = [colors.rgb2hex(spectral(i)) for i in range(len(tags))]\n", "color_map = dict(zip(*(tags, tag_colors)))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " Amazon\n", " ORGANIZATION\n", "\n", " (\n", "\n", " NASDAQ:AMZN\n", " STOCK_CODE\n", "\n", ") and \n", "\n", " Whole Foods Market, Inc.\n", " ORGANIZATION\n", "\n", " (\n", "\n", " NASDAQ:WFM\n", " STOCK_CODE\n", "\n", ") \n", "\n", " today\n", " DATE\n", "\n", " announced that they have entered into a definitive \n", "\n", " merger\n", " CORPORATE_MERGER\n", "\n", " agreement under which \n", "\n", " Amazon\n", " ORGANIZATION\n", "\n", " will \n", "\n", " acquire\n", " CORPORATE_ACQUISITION\n", "\n", " \n", "\n", " Whole Foods Market\n", " ORGANIZATION\n", "\n", " for \n", "\n", " $42\n", " MONETARY_VALUE\n", "\n", " per share in an all-cash \n", "\n", " transaction\n", " CORPORATE_ACQUISITION\n", "\n", " valued at approximately \n", "\n", " $13.7 billion\n", " MONETARY_VALUE\n", "\n", ", including \n", "\n", " Whole Foods Market\n", " ORGANIZATION\n", "\n", "’s net debt. “Millions of people love \n", "\n", " Whole Foods Market\n", " ORGANIZATION\n", "\n", " because \n", "\n", " they\n", " ORGANIZATION\n", "\n", " offer the best natural and organic foods, and \n", "\n", " they\n", " PERSON\n", "\n", " make it fun to eat healthy,” said Jeff Bezos, \n", "\n", " Amazon\n", " ORGANIZATION\n", "\n", " founder and CEO. “\n", "\n", " Whole Foods Market\n", " ORGANIZATION\n", "\n", " has been satisfying, delighting and nourishing customers for nearly four decades – they’re doing an amazing job and \n", "\n", " we\n", " ORGANIZATION\n", "\n", " want that to continue.” “This \n", "\n", " partnership\n", " CORPORATE_MERGER\n", "\n", " presents an opportunity to maximize value for \n", "\n", " Whole Foods Market\n", " ORGANIZATION\n", "\n", "’s shareholders, while at the same time extending our mission and bringing the highest quality, experience, convenience and innovation to our customers,” said \n", "\n", " John Mackey\n", " PERSON\n", "\n", ", \n", "\n", " Whole Foods Market\n", " ORGANIZATION\n", "\n", " co-founder and \n", "\n", " CEO\n", " PERSON_TITLE\n", "\n", ". \n", "\n", " Whole Foods Market\n", " ORGANIZATION\n", "\n", " will continue to operate stores under the \n", "\n", " Whole Foods Market\n", " ORGANIZATION\n", "\n", " brand and source from trusted vendors and partners around the world. \n", "\n", " John Mackey\n", " PERSON\n", "\n", " will \n", "\n", " remain\n", " EMPLOYMENT\n", "\n", " as \n", "\n", " CEO\n", " PERSON_TITLE\n", "\n", " of \n", "\n", " Whole Foods Market\n", " ORGANIZATION\n", "\n", " and \n", "\n", " Whole Foods Market\n", " ORGANIZATION\n", "\n", "’s headquarters will stay in Austin, Texas. Completion of the \n", "\n", " transaction\n", " CORPORATE_MERGER\n", "\n", " is subject to approval by \n", "\n", " Whole Foods Market\n", " ORGANIZATION\n", "\n", "'s shareholders, regulatory approvals and other customary closing conditions. The parties expect to close the \n", "\n", " transaction\n", " CORPORATE_MERGER\n", "\n", " \n", "\n", " during the second half of 2017\n", " DATE\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Note that only Entities participating in Events are shown.\n", "displacy.render(output, style=\"ent\", options={\"colors\": color_map}, manual=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Rendering as tabular data\n", "\n", "Many users will use Events to create structured data from unstructured text. Here we'll demonstrate how to do this with `pandas`. First, we flatten hierarchical JSON to pandas dataframe. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# Creation of the entity dataframe. Entity indices must be explicitly created.\n", "entities_df = pd.DataFrame([\n", " {\"EntityIndex\": i, **m}\n", " for i, e in enumerate(result['Entities'])\n", " for m in e['Mentions']\n", "])\n", "\n", "# Creation of the events dataframe. Event indices must be explicitly created.\n", "events_df = pd.DataFrame([\n", " {\"EventIndex\": i, **a, **t}\n", " for i, e in enumerate(result['Events'])\n", " for a in e['Arguments']\n", " for t in e['Triggers']\n", "])\n", "\n", "# Join the two tables into one flat data structure.\n", "events_df = events_df.merge(entities_df, on=\"EntityIndex\", suffixes=('Event', 'Entity'))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EventIndexEntityIndexRoleScoreEventBeginOffsetEventEndOffsetEventGroupScoreEventTextEventTypeEventBeginOffsetEntityEndOffsetEntityGroupScoreEntityScoreEntityTextEntityTypeEntity
004DATE0.9996111201261.000000mergerCORPORATE_MERGER63681.0000000.994578todayDATE
104DATE0.9998296626730.999969partnershipCORPORATE_MERGER63681.0000000.994578todayDATE
204DATE0.992193123712480.509698transactionCORPORATE_MERGER63681.0000000.994578todayDATE
304DATE0.998367140314140.336709transactionCORPORATE_MERGER63681.0000000.994578todayDATE
414DATE0.9999581611681.000000acquireCORPORATE_ACQUISITION63681.0000000.994578todayDATE
................................................
13210INVESTOR0.9311372212320.999985transactionCORPORATE_ACQUISITION4684740.5846940.998912AmazonORGANIZATION
13326EMPLOYEE0.999938111611221.000000remainEMPLOYMENT8979081.0000000.999606John MackeyPERSON
13426EMPLOYEE0.999938111611221.000000remainEMPLOYMENT109911100.9771110.999699John MackeyPERSON
13527EMPLOYEE_TITLE0.999938111611221.000000remainEMPLOYMENT9449471.0000000.997071CEOPERSON_TITLE
13627EMPLOYEE_TITLE0.999938111611221.000000remainEMPLOYMENT112611290.7781980.998065CEOPERSON_TITLE
\n", "

137 rows × 15 columns

\n", "
" ], "text/plain": [ " EventIndex EntityIndex Role ScoreEvent BeginOffsetEvent \\\n", "0 0 4 DATE 0.999611 120 \n", "1 0 4 DATE 0.999829 662 \n", "2 0 4 DATE 0.992193 1237 \n", "3 0 4 DATE 0.998367 1403 \n", "4 1 4 DATE 0.999958 161 \n", ".. ... ... ... ... ... \n", "132 1 0 INVESTOR 0.931137 221 \n", "133 2 6 EMPLOYEE 0.999938 1116 \n", "134 2 6 EMPLOYEE 0.999938 1116 \n", "135 2 7 EMPLOYEE_TITLE 0.999938 1116 \n", "136 2 7 EMPLOYEE_TITLE 0.999938 1116 \n", "\n", " EndOffsetEvent GroupScoreEvent TextEvent TypeEvent \\\n", "0 126 1.000000 merger CORPORATE_MERGER \n", "1 673 0.999969 partnership CORPORATE_MERGER \n", "2 1248 0.509698 transaction CORPORATE_MERGER \n", "3 1414 0.336709 transaction CORPORATE_MERGER \n", "4 168 1.000000 acquire CORPORATE_ACQUISITION \n", ".. ... ... ... ... \n", "132 232 0.999985 transaction CORPORATE_ACQUISITION \n", "133 1122 1.000000 remain EMPLOYMENT \n", "134 1122 1.000000 remain EMPLOYMENT \n", "135 1122 1.000000 remain EMPLOYMENT \n", "136 1122 1.000000 remain EMPLOYMENT \n", "\n", " BeginOffsetEntity EndOffsetEntity GroupScoreEntity ScoreEntity \\\n", "0 63 68 1.000000 0.994578 \n", "1 63 68 1.000000 0.994578 \n", "2 63 68 1.000000 0.994578 \n", "3 63 68 1.000000 0.994578 \n", "4 63 68 1.000000 0.994578 \n", ".. ... ... ... ... \n", "132 468 474 0.584694 0.998912 \n", "133 897 908 1.000000 0.999606 \n", "134 1099 1110 0.977111 0.999699 \n", "135 944 947 1.000000 0.997071 \n", "136 1126 1129 0.778198 0.998065 \n", "\n", " TextEntity TypeEntity \n", "0 today DATE \n", "1 today DATE \n", "2 today DATE \n", "3 today DATE \n", "4 today DATE \n", ".. ... ... \n", "132 Amazon ORGANIZATION \n", "133 John Mackey PERSON \n", "134 John Mackey PERSON \n", "135 CEO PERSON_TITLE \n", "136 CEO PERSON_TITLE \n", "\n", "[137 rows x 15 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "events_df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### A more succinct representation\n", "\n", "We're primarity interested in the *event structure*, so let's make that more transparent by creating a new table with Roles as column headers, grouped by Event." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "def format_compact_events(x):\n", " \"\"\"Collapse groups of mentions and triggers into a single set.\"\"\"\n", " # Take the most commonly occurring EventType and the set of triggers.\n", " d = {\"EventType\": Counter(x['TypeEvent']).most_common()[0][0],\n", " \"Triggers\": set(x['TextEvent'])}\n", " # For each argument Role, collect the set of mentions in the group.\n", " for role in x['Role']:\n", " d.update({role: set((x[x['Role']==role]['TextEntity']))})\n", " return d\n", "\n", "# Group data by EventIndex and format.\n", "event_analysis_df = pd.DataFrame(\n", " events_df.groupby(\"EventIndex\").apply(format_compact_events).tolist()\n", ").fillna('')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EventTypeTriggersDATEPARTICIPANTINVESTEEAMOUNTINVESTOREMPLOYEREMPLOYEEEMPLOYEE_TITLE
0CORPORATE_MERGER{transaction, partnership, merger}{during the second half of 2017, today}{they, NASDAQ:WFM, we, Whole Foods Market, Who...
1CORPORATE_ACQUISITION{acquire, transaction}{today}{we, they, Whole Foods Market, Whole Foods Mar...{$13.7 billion, $42}{Amazon}
2EMPLOYMENT{remain}{we, they, Whole Foods Market, Whole Foods Mar...{John Mackey}{CEO}
\n", "
" ], "text/plain": [ " EventType Triggers \\\n", "0 CORPORATE_MERGER {transaction, partnership, merger} \n", "1 CORPORATE_ACQUISITION {acquire, transaction} \n", "2 EMPLOYMENT {remain} \n", "\n", " DATE \\\n", "0 {during the second half of 2017, today} \n", "1 {today} \n", "2 \n", "\n", " PARTICIPANT \\\n", "0 {they, NASDAQ:WFM, we, Whole Foods Market, Who... \n", "1 \n", "2 \n", "\n", " INVESTEE AMOUNT \\\n", "0 \n", "1 {we, they, Whole Foods Market, Whole Foods Mar... {$13.7 billion, $42} \n", "2 \n", "\n", " INVESTOR EMPLOYER EMPLOYEE \\\n", "0 \n", "1 {Amazon} \n", "2 {we, they, Whole Foods Market, Whole Foods Mar... {John Mackey} \n", "\n", " EMPLOYEE_TITLE \n", "0 \n", "1 \n", "2 {CEO} " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "event_analysis_df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Graphing event semantics\n", "\n", "The most striking representation of Comprehend Events output is found in a semantic graph, a network of the entities and events referenced in a document or documents. The code below uses two open source libraries, `networkx` and `pyvis`, to render events system output. In the resulting graph, nodes are entity mentions and triggers, while edges are the argument roles held by the entities in relation to the triggers." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### Formatting the data\n", "\n", "System output must first be conformed to the node (i.e., vertex) and edge list format required by `networkx`. This requires iterating over triggers, entities, and argument structural relations. Note that we can use the `GroupScore` and `Score` keys on various objects to prune nodes and edges in which the model has less confidence. We can also use various strategies to pick a 'canonical' mention from each mention group to appear in the graph; here we chose the mention with the string-wise longest extent." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# Entities are associated with events by group, not individual mention; for simplicity, \n", "# assume the canonical mention is the longest one.\n", "def get_canonical_mention(mentions):\n", " extents = enumerate([m['Text'] for m in mentions])\n", " longest_name = sorted(extents, key=lambda x: len(x[1]))\n", " return [mentions[longest_name[-1][0]]]\n", "\n", "# Set a global confidence threshold\n", "thr = 0.5\n", "\n", "# Nodes are (id, type, tag, score, mention_type) tuples.\n", "trigger_nodes = [\n", " (\"tr%d\" % i, t['Type'], t['Text'], t['Score'], \"trigger\")\n", " for i, e in enumerate(result['Events'])\n", " for t in e['Triggers'][:1]\n", " if t['GroupScore'] > thr\n", "]\n", "entity_nodes = [\n", " (\"en%d\" % i, m['Type'], m['Text'], m['Score'], \"entity\")\n", " for i, e in enumerate(result['Entities'])\n", " for m in get_canonical_mention(e['Mentions'])\n", " if m['GroupScore'] > thr\n", "]\n", "\n", "# Edges are (trigger_id, node_id, role, score) tuples.\n", "argument_edges = [\n", " (\"tr%d\" % i, \"en%d\" % a['EntityIndex'], a['Role'], a['Score'])\n", " for i, e in enumerate(result['Events'])\n", " for a in e['Arguments']\n", " if a['Score'] > thr\n", "] " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### Create a compact graph\n", "\n", "Once the nodes and edges are defines, we can create and visualize the graph." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "G = nx.Graph()\n", "\n", "# Iterate over triggers and entity mentions.\n", "for mention_id, tag, extent, score, mtype in trigger_nodes + entity_nodes:\n", " label = extent if mtype.startswith(\"entity\") else tag\n", " G.add_node(mention_id, label=label, size=score*10, color=color_map[tag], tag=tag, group=mtype)\n", " \n", "# Iterate over argument role assignments\n", "for event_id, entity_id, role, score in argument_edges:\n", " G.add_edges_from(\n", " [(event_id, entity_id)],\n", " label=role,\n", " weight=score*100,\n", " color=\"grey\"\n", " )\n", "\n", "# Drop mentions that don't participate in events\n", "G.remove_nodes_from(list(nx.isolates(G)))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nt = Network(\"600px\", \"800px\", notebook=True, heading=\"\")\n", "nt.from_nx(G)\n", "nt.show(\"compact_nx.html\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A more complete graph\n", "\n", "The graph above is compact, only relaying essential event type and argument role information. We can use a slightly more complicated set of functions to graph all of the information returned by the API." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This convenience function in `events_graph.py` plots a complete graph of the document,\n", "# showing all events, triggers, entities, and their groups.\n", "\n", "import events_graph as evg\n", "\n", "evg.plot(result, node_types=['event', 'trigger', 'entity_group', 'entity'], thr=0.5)" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }