{ "cells": [ { "cell_type": "markdown", "id": "8d636275", "metadata": {}, "source": [ "# Dashboarding SEC Text for Financial NLP\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "8d636275", "metadata": {}, "source": [ "\n", "The U.S. Securities and Exchange Commission (SEC) filings are widely used in finance. Companies file the SEC filings to notify the world about their business conditions and the future outlook of the companies. Because of the potential predictive values, the SEC filings are good sources of information for workers in finance, ranging from individual investors to executives of large financial corporations. These filings are publicly available to all investors. \n", "\n", "In this example notebook, we focus on the following three types of SEC filings: 10-Ks, 10-Qs, and 8-Ks. \n", "\n", "* [10-Ks](https://www.investopedia.com/terms/1/10-k.asp) - Annual reports of companies (and will be quite detailed).\n", "* [10-Qs](https://www.investopedia.com/terms/1/10q.asp) - Quarterly reports, except in the quarter in which a 10K is filed (and are less detailed than 10-Ks).\n", "* [8-Ks](https://www.investopedia.com/terms/1/8-k.asp) - Filed at every instance when there is a change in business conditions that is material and needs to be reported. This means that there can be multiple 8-Ks filed throughout the fiscal year. \n", "\n", "The functionality of SageMaker JumpStart Industry will be presented throughout the notebook, which provides an overall dashboard to visualize the three types of filings with various analyses. We can append several standard financial characteristics, such as *Analyst Recommendation Mean* and *Return on Equity*, but one interesting part of the dashboard is *attribute scoring*. Using word lists derived from natural language processing (NLP) techniques, we will score the actual texts of these filings for a number of characteristics, such as risk, uncertainty, and positivity, as word proportions, providing simple, accessible numbers to represent these traits. Using this dashboard, anybody can pull up information and related statistics about any companies they have interest in, and digest it in a simple, useful way." ] }, { "cell_type": "markdown", "id": "c6a6aedf", "metadata": {}, "source": [ "## General Steps\n", "\n", "This notebook goes through the following steps to demonstrate how to extract texts from specific sections in SEC filings, score the texts, and summarize them.\n", "\n", "1. Retrieve and parse 10-K, 10-Q, 8-K filings. Retrieving these filings from SEC's EDGAR service is complicated, and parsing these forms into plain text for further analysis can be time-consuming. We provide the [SageMaker JumpStart Industry Python SDK](https://sagemaker-jumpstart-industry-pack.readthedocs.io/en/latest/index.html) to create a curated dataset in a *single API call*.\n", "2. Create separate dataframes for each of the three types of forms, along with separate columns for each extracted section. \n", "3. Combine two or more sections of the 10-K forms and shows how to use the NLP scoring API to add numerical values to the columns for the text of these columns. The column is called `text2score`. \n", "4. Add a column with a summary of the `text2score` column. \n", "5. Prepare the final dataframe that can be used as input for a dashboard.\n", "\n", "One of the features of this notebook helps break long SEC filings into separate sections, each of which deals with different aspects of a company's reporting. The goal of this example notebook is to make accessing and processing texts from SEC filing easy for investors and training their algorithms.\n", "\n", "**Note**: You can also access this notebook through SageMaker JumpStart that is executable on SageMaker Studio. For more information, see [Amazon SageMaker JumpStart Industry](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart-industry.html) in the **Amazon SageMaker Developer Guide**." ] }, { "cell_type": "markdown", "id": "c25012ad", "metadata": {}, "source": [ ">**Important**: \n", ">This example notebook is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice." ] }, { "cell_type": "markdown", "id": "906d9742", "metadata": {}, "source": [ "## Financial NLP\n", "\n", "Financial NLP is one of the rapidly increasing use cases of ML in industry. To find more discussion about this, see the following survey paper: [Deep Learning for Financial Applications: A Survey](https://arxiv.org/abs/2002.05786). The starting point for a vast amount of financial NLP is about extracting and processing texts in SEC filings. The SEC filings report different types of information related to various events involving companies. To find a complete list of SEC forms, see [Forms List](https://www.sec.gov/forms). \n", "\n", "The SEC filings are widely used by financial services and companies as a source of information about companies in order to make trading, lending, investment, and risk management decisions. They contain forward-looking information that helps with forecasts and are written with a view to the future. In addition, in recent times, the value of historical time-series data has degraded, since economies have been structurally transformed by trade wars, pandemics, and political upheavals. Therefore, text as a source of forward-looking information has been increasing in relevance. \n", "\n", "There has been an exponential growth in downloads of SEC filings. See [How to Talk When a Machine is Listening: Corporate Disclosure in the Age of AI](https://www.nber.org/papers/w27950); this paper reports that the number of machine downloads of corporate 10-K and 10-Q filings increased from 360,861 in 2003 to 165,318,719 in 2016. \n", "\n", "A vast body of academic and practitioner research that is based on financial text, a significant portion of which is based on SEC filings. A recent review article summarizing this work is [Textual Analysis in Finance (2020)](https://www.annualreviews.org/doi/abs/10.1146/annurev-financial-012820-032249). \n", "\n", "This notebook describes how a user can quickly retrieve a set of forms, break them into sections, score texts in each section using pre-defined word lists, and prepare a dashboard to filter the data. " ] }, { "cell_type": "markdown", "id": "bc97786f", "metadata": {}, "source": [ "## SageMaker Notebook Kernel Setup\n", "\n", "Recommended kernel is **conda_python3**.\n", "For the instance type, using a larger instance with sufficient memory can be helpful to download the following materials." ] }, { "cell_type": "markdown", "id": "8f4a184b", "metadata": {}, "source": [ "## Load SDK and Helper Scripts" ] }, { "cell_type": "markdown", "id": "1c99e7eb", "metadata": {}, "source": [ "First, we import required packages and load the S3 bucket from SageMaker session, as shown below." ] }, { "cell_type": "code", "execution_count": null, "id": "3a668fea", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import pandas as pd\n", "import sagemaker\n", "\n", "pd.get_option(\"display.max_columns\", None)" ] }, { "cell_type": "code", "execution_count": null, "id": "04b79db2", "metadata": {}, "outputs": [], "source": [ "# Prepare the SageMaker session's default S3 bucket and a folder to store processed data\n", "session = sagemaker.Session()\n", "region = session._region_name\n", "bucket = session.default_bucket()\n", "role = sagemaker.get_execution_role()\n", "secdashboard_processed_folder = \"jumpstart_industry_secdashboard_processed\"" ] }, { "cell_type": "markdown", "id": "ad27f25f", "metadata": {}, "source": [ "### Install the `smjsindustry` library" ] }, { "cell_type": "markdown", "id": "0709ec5c", "metadata": {}, "source": [ "The following code cell downloads the [`smjsindustry` SDK](https://pypi.org/project/smjsindustry/) and helper scripts from the S3 buckets prepared by SageMaker JumpStart Industry. You will learn how to use the `smjsindustry` SDK which contains various APIs to curate SEC datasets. The dataset in this example was synthetically generated using the `smjsindustry` package's SEC Forms Retrieval tool. For more information, see the [SageMaker JumpStart Industry Python SDK documentation](https://sagemaker-jumpstart-industry-pack.readthedocs.io/en/latest/notebooks/index.html). " ] }, { "cell_type": "code", "execution_count": null, "id": "a8fe7661", "metadata": {}, "outputs": [], "source": [ "# Download scripts from S3\n", "notebook_artifact_bucket = f\"jumpstart-cache-prod-{region}\"\n", "notebook_sdk_prefix = \"smfinance-notebook-dependency/smjsindustry\"\n", "notebook_script_prefix = \"smfinance-notebook-data/sec-dashboard\"\n", "\n", "# Download smjsindustry SDK\n", "sdk_bucket = f\"s3://{notebook_artifact_bucket}/{notebook_sdk_prefix}\"\n", "!aws s3 sync $sdk_bucket ./\n", "\n", "# Download helper scripts\n", "scripts_bucket = f\"s3://{notebook_artifact_bucket}/{notebook_script_prefix}\"\n", "!aws s3 sync $scripts_bucket ./sec-dashboard" ] }, { "cell_type": "markdown", "id": "f0f11a3b", "metadata": {}, "source": [ "We deliver APIs through the `smjsindustry` client library. The first step requires pip installing a Python package that interacts with a SageMaker processing container. The retrieval, parsing, transforming, and scoring of text is a complex process and uses many different algorithms and packages. To make this seamless and stable for the user, the functionality is packaged into a collection of APIs. For installation and maintenance of the workflow, this approach reduces your effort to a pip install followed by a single API call." ] }, { "cell_type": "code", "execution_count": null, "id": "aef31368", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Install smjsindustry SDK\n", "!pip install --no-index smjsindustry-1.0.0-py3-none-any.whl" ] }, { "cell_type": "code", "execution_count": null, "id": "c76d861c", "metadata": {}, "outputs": [], "source": [ "%pylab inline" ] }, { "cell_type": "markdown", "id": "1bdfb04f", "metadata": {}, "source": [ "The preceding line loads in several standard packages, including NumPy, SciPy, and matplotlib." ] }, { "cell_type": "markdown", "id": "ca0d7ca2", "metadata": {}, "source": [ "## Load the functions for extracting the \"Item\" sections from the forms\n", "\n", "We created various helper functions to enable sectioning the SEC forms. These functions do take around 1 minute to load." ] }, { "cell_type": "code", "execution_count": null, "id": "88da5076", "metadata": { "scrolled": true }, "outputs": [], "source": [ "%run sec-dashboard/SEC_Section_Extraction_Functions.ipynb" ] }, { "cell_type": "markdown", "id": "e4578f5d", "metadata": {}, "source": [ "Next, we import ```smjsindustry``` package, as shown below." ] }, { "cell_type": "code", "execution_count": null, "id": "dc8072e3", "metadata": {}, "outputs": [], "source": [ "import smjsindustry\n", "from smjsindustry.finance import utils\n", "from smjsindustry import NLPScoreType, NLPSCORE_NO_WORD_LIST\n", "from smjsindustry import NLPScorerConfig, JaccardSummarizerConfig, KMedoidsSummarizerConfig\n", "from smjsindustry import Summarizer, NLPScorer\n", "from smjsindustry.finance.processor import DataLoader, SECXMLFilingParser\n", "from smjsindustry.finance.processor_config import EDGARDataSetConfig" ] }, { "cell_type": "markdown", "id": "f1f4c968", "metadata": {}, "source": [ "## Download the filings you wish to work with\n", "\n", "Downloading SEC filings is done from the SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR) website, which provides open data access. EDGAR is the primary system under the U.S. Securities And Exchange Commission (SEC) for companies and others submitting documents under the Securities Act of 1933, the Securities Exchange Act of 1934, the Trust Indenture Act of 1939, and the Investment Company Act of 1940. EDGAR contains millions of company and individual filings. The system processes about 3,000 filings per day, serves up 3,000 terabytes of data to the public annually, and accommodates 40,000 new filers per year on average. Below we provide a simple *one*-API call that will create a dataset of plain text filings in a few lines of code, for any period of time and for a large number of tickers. \n", "\n", "We have wrapped the extraction functionality into a SageMaker processing container and provide this notebook to enable users to download a dataset of filings with metadata such as dates and parsed plain text that can then be used for machine learning using other SageMaker tools. Users only need to specify a date range and a list of ticker symbols and this API will do the rest.\n", "\n", "The extracted dataframe is written to S3 storage and to the local notebook instance. " ] }, { "cell_type": "markdown", "id": "a4e016ec", "metadata": {}, "source": [ "The API below specifies the machine to be used and the volume size. It also specifies the tickers or CIK codes for the companies to be covered, as well as the 3 form types (10-K, 10-Q, 8-K) to be retrieved. The data range is also specified as well as the filename (CSV) where the retrieved filings will be stored. " ] }, { "cell_type": "markdown", "id": "464ff484", "metadata": {}, "source": [ "The API is in 3 parts: \n", "\n", "1. Set up a dataset configuration (an `EDGARDataSetConfig` object). This specifies (i) the tickers or SEC CIK codes for the companies whose forms are being extracted; (ii) the SEC forms types (in this case 10-K, 10-Q, 8-K); (iii) date range of forms by filing date, (iv) the output CSV file and S3 bucket to store the dataset. \n", "2. Set up a data loader object (a `DataLoader` object). The middle section shows how to assign system resources and has default values in place. \n", "3. Run the data loader (`data_loader.load`).\n", "\n", "This initiates a processing job running in a SageMaker container." ] }, { "cell_type": "markdown", "id": "cc48c4f1", "metadata": {}, "source": [ ">**Important**: \n", ">This example notebook uses data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR\u2019s access terms and conditions located in the [Accessing EDGAR Data](https://www.sec.gov/os/accessing-edgar-data) page." ] }, { "cell_type": "code", "execution_count": null, "id": "27924f47", "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "\n", "dataset_config = EDGARDataSetConfig(\n", " tickers_or_ciks=[\n", " \"amzn\",\n", " \"goog\",\n", " \"27904\",\n", " \"fb\",\n", " \"msft\",\n", " \"uber\",\n", " \"nflx\",\n", " ], # list of stock tickers or CIKs\n", " form_types=[\"10-K\", \"10-Q\", \"8-K\"], # list of SEC form types\n", " filing_date_start=\"2019-01-01\", # starting filing date\n", " filing_date_end=\"2020-12-31\", # ending filing date\n", " email_as_user_agent=\"test-user@test.com\",\n", ") # user agent email\n", "\n", "data_loader = DataLoader(\n", " role=role, # loading job execution role\n", " instance_count=1, # instances number, limit varies with instance type\n", " instance_type=\"ml.c5.2xlarge\", # instance type\n", " volume_size_in_gb=30, # size in GB of the EBS volume to use\n", " volume_kms_key=None, # KMS key ID to encrypt the processing volume\n", " output_kms_key=None, # KMS key ID to encrypt processing job outputs\n", " max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours.\n", " sagemaker_session=session, # session object\n", " tags=None,\n", ") # a list of key-value pairs\n", "\n", "data_loader.load(\n", " dataset_config,\n", " \"s3://{}/{}/{}\".format(\n", " bucket, secdashboard_processed_folder, \"output\"\n", " ), # output s3 prefix (both bucket and folder names are required)\n", " \"dataset_10k_10q_8k_2019_2021.csv\", # output file name\n", " wait=True,\n", " logs=True,\n", ")" ] }, { "cell_type": "markdown", "id": "ee1382b0", "metadata": {}, "source": [ "## Copy the file into Studio from the s3 bucket\n", "\n", "We can examine the dataframe that was constructed by the API. " ] }, { "cell_type": "code", "execution_count": null, "id": "3e4c28af", "metadata": {}, "outputs": [], "source": [ "client = boto3.client(\"s3\")\n", "client.download_file(\n", " bucket,\n", " \"{}/{}/{}\".format(secdashboard_processed_folder, \"output\", \"dataset_10k_10q_8k_2019_2021.csv\"),\n", " \"dataset_10k_10q_8k_2019_2021.csv\",\n", ")" ] }, { "cell_type": "markdown", "id": "3b5e5528", "metadata": {}, "source": [ "See how a complete dataset was prepared. Altogether, a few hundred forms were retrieved across tickers and the three types of SEC form. " ] }, { "cell_type": "code", "execution_count": null, "id": "836cce4c", "metadata": {}, "outputs": [], "source": [ "df_forms = pd.read_csv(\"dataset_10k_10q_8k_2019_2021.csv\")\n", "df_forms" ] }, { "cell_type": "markdown", "id": "ebd53c35", "metadata": {}, "source": [ "Here is a breakdown of the few hundred forms by **ticker** and **form_type**. " ] }, { "cell_type": "code", "execution_count": null, "id": "d57c0d85", "metadata": {}, "outputs": [], "source": [ "df_forms.groupby([\"ticker\", \"form_type\"]).count().reset_index()" ] }, { "cell_type": "markdown", "id": "37cad025", "metadata": {}, "source": [ "## Create the dataframe for the extracted item sections from the 10-K filings\n", "\n", "In this section, we break the various sections of the 10-K filings into separate columns of the extracted dataframe. \n", "\n", "1. Take a subset of the dataframe by specifying `df.form_type == \"10-K\"`.\n", "2. Extract the sections for each 10-K filing and put them in columns in a separate dataframe.\n", "3. Merge this dataframe with the dataframe from Step 1. \n", "\n", "You can examine the cells in the dataframe below to see the text from each section. " ] }, { "cell_type": "code", "execution_count": null, "id": "281e3c88", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"dataset_10k_10q_8k_2019_2021.csv\")\n", "df_10K = df[df.form_type == \"10-K\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "6b214a0f", "metadata": {}, "outputs": [], "source": [ "# Construct the DataFrame row by row.\n", "items_10K = pd.DataFrame(columns=columns_10K, dtype=object)\n", "for i in df_10K.index:\n", " form_text = df_10K.text[i]\n", " item_iter = get_form_items(form_text, \"10-K\")\n", " items_10K.loc[i] = items_to_df_row(item_iter, columns_10K, \"10-K\")" ] }, { "cell_type": "code", "execution_count": null, "id": "50664a87", "metadata": {}, "outputs": [], "source": [ "items_10K.rename(columns=header_mappings_10K, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "217fa3b2", "metadata": {}, "outputs": [], "source": [ "df_10K = pd.merge(df_10K, items_10K, left_index=True, right_index=True)\n", "df_10K.head(10)" ] }, { "cell_type": "markdown", "id": "bfb3ab22", "metadata": {}, "source": [ "Let's take a look at the text in one of the columns to see that there is clean, parsed, plain text provided by the API: " ] }, { "cell_type": "code", "execution_count": null, "id": "9d536797", "metadata": { "scrolled": true }, "outputs": [], "source": [ "print(df_10K[\"Risk Factors\"][138])" ] }, { "cell_type": "markdown", "id": "daa6762b", "metadata": {}, "source": [ "## Similarly, we can create the dataframe for the extracted item sections from the 10-Q filings\n", "\n", "1. Take a subset of the dataframe by specifying `df.form_type == \"10-Q\"`.\n", "2. Extract the sections for each 10-Q filing and put them in columns in a separate dataframe.\n", "3. Merge this dataframe with the dataframe from 1. " ] }, { "cell_type": "code", "execution_count": null, "id": "e8fd7496", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"dataset_10k_10q_8k_2019_2021.csv\")\n", "df_10Q = df[df.form_type == \"10-Q\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "11955d2a", "metadata": {}, "outputs": [], "source": [ "# Construct the DataFrame row by row.\n", "items_10Q = pd.DataFrame(columns=columns_10Q, dtype=object)\n", "for i in df_10Q.index:\n", " form_text = df_10Q.text[i]\n", " item_iter = get_form_items(form_text, \"10-Q\")\n", " items_10Q.loc[i] = items_to_df_row(item_iter, columns_10Q, \"10-Q\")" ] }, { "cell_type": "code", "execution_count": null, "id": "a2a0258e", "metadata": {}, "outputs": [], "source": [ "items_10Q.rename(columns=header_mappings_10Q, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "0ff26ca6", "metadata": {}, "outputs": [], "source": [ "df_10Q = pd.merge(df_10Q, items_10Q, left_index=True, right_index=True)\n", "df_10Q.head(10)" ] }, { "cell_type": "markdown", "id": "c4a0113c", "metadata": {}, "source": [ "## Create the dataframe for the extracted item sections from the 8-K filings\n", "\n", "1. Take a subset of the dataframe by specifying `df.form_type == \"8-K\"`.\n", "2. Extract the sections for each 8-K filing and put them in columns in a separate dataframe.\n", "3. Merge this dataframe with the dataframe from Step 1. " ] }, { "cell_type": "code", "execution_count": null, "id": "712e7b57", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"dataset_10k_10q_8k_2019_2021.csv\")\n", "df_8K = df[df.form_type == \"8-K\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "bcbaf82b", "metadata": {}, "outputs": [], "source": [ "# Construct the DataFrame row by row.\n", "items_8K = pd.DataFrame(columns=columns_8K, dtype=object)\n", "for i in df_8K.index:\n", " form_text = df_8K.text[i]\n", " item_iter = get_form_items(form_text, \"8-K\")\n", " items_8K.loc[i] = items_to_df_row(item_iter, columns_8K, \"8-K\")" ] }, { "cell_type": "code", "execution_count": null, "id": "951a0078", "metadata": {}, "outputs": [], "source": [ "items_8K.rename(columns=header_mappings_8K, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "52e3bdc1", "metadata": {}, "outputs": [], "source": [ "df_8K = pd.merge(df_8K, items_8K, left_index=True, right_index=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "b0118dfd", "metadata": {}, "outputs": [], "source": [ "df1 = df_8K.copy()\n", "df1 = df1.mask(df1.apply(lambda x: x.str.len().lt(1)))\n", "df1" ] }, { "cell_type": "markdown", "id": "6e313724", "metadata": {}, "source": [ "## Summary table of section counts" ] }, { "cell_type": "code", "execution_count": null, "id": "254a3b2b", "metadata": {}, "outputs": [], "source": [ "df1 = df1.groupby(\"ticker\").count()\n", "df1[df1.columns[5:]]" ] }, { "cell_type": "markdown", "id": "523f923d", "metadata": {}, "source": [ "## NLP scoring of the 10-K forms for specific sections\n", "\n", "Financial text has been scored using word lists for some time. See the paper [\"Textual Analysis in Finance\"](https://www.investopedia.com/terms/1/8-k.asp) for a comprehensive review. \n", "\n", "The `smjsindustry` library provides 11 NLP score types by default: `positive`, `negative`, `litigious`, `polarity`, `risk`, `readability`, `fraud`, `safe`, `certainty`, `uncertainty`, and `sentiment`. Each score (except readability and sentiment) has its word list, which is used for scanning and matching with an input text dataset.\n", "\n", "NLP scoring delivers a score as the fraction of words in a document that are in the relevant scoring word lists. Users can provide their own custom word list to calculate the NLP scores. Some scores like readability use standard formulae such as the Gunning-Fog score. Sentiment scores are based on the [VADER](https://pypi.org/project/vaderSentiment/) library. \n", "\n", "These NLP scores are added as new numerical columns to the text dataframe; this creates a multimodal dataframe, which is a mixture of tabular data and longform text, called **TabText**. When submitting this multimodal dataframe for ML, it is a good idea to normalize the columns of NLP scores (usually with standard normalization or min-max scaling).\n", "\n", "Any chosen text column can be scored automatically using the tools in SageMaker JumpStart. We demonstrate this below. \n", "\n", "As an example, we combine the MD&A section (Item 7) and the Risk section (Item 7A), and then apply NLP scoring. We compute 11 additional columns for various types of scores. \n", "\n", "Since the size of the SEC filings text can be very large, NLP scoring is computationally time-consuming, so we have built the API to enable distribution of this task across multiple machines. In the API, users can choose the number and type of machine instances they want to run NLP scoring on in distributed fashion. \n", "\n", "To begin, earmark the text for NLP scoring by creating a new column that combines two columns into a single column called `text2score`. A new file is saved in the Amazon S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "id": "e59aedf5", "metadata": {}, "outputs": [], "source": [ "df_10K[\"text2score\"] = [\n", " i + \" \" + j\n", " for i, j in zip(\n", " df_10K[\n", " \"Management\u2019s Discussion and Analysis of Financial Condition and Results of Operations\"\n", " ],\n", " df_10K[\"Quantitative and Qualitative Disclosures about Market Risk\"],\n", " )\n", "]\n", "df_10K[[\"ticker\", \"text2score\"]].to_csv(\"text2score.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "e94883cb", "metadata": {}, "outputs": [], "source": [ "client.upload_file(\n", " \"text2score.csv\",\n", " bucket,\n", " \"{}/{}/{}\".format(secdashboard_processed_folder, \"output\", \"text2score.csv\"),\n", ")" ] }, { "cell_type": "markdown", "id": "96746898", "metadata": {}, "source": [ "**Technical notes**:\n", "\n", "1. The NLPScorer sends SageMaker processing job requests to processing containers. It might take a few minutes when spinning up a processing container. The actual filings extraction start after the initial spin-up. \n", "2. You are not charged for the waiting time used for the initial spin-up. \n", "3. You can run processing jobs in multiple instances.\n", "4. The name of the processing job is shown in the runtime log.\n", "5. You can also access the processing job from the [SageMaker console](https://console.aws.amazon.com/sagemaker). On the left navigation pane, choose Processing, Processing job.\n", "6. NLP scoring can be slow for massive documents such as SEC filings, which contain anywhere from 20,000-100,000 words. Matching to word lists (usually 200 words or more) can be time-consuming. \n", "7. VPC mode is supported in this API.\n", "\n", "**Input**\n", "\n", "The input to the API requires (i) what NLP scores to be generated, each one resulting in a new column in the dataframe; (ii) specification of system resources, i.e., number and type of machine instances to be used; (iii) the s3 bucket and filename in which to store the enhanced dataframe as a CSV file; (iv) a section that kicks off the API.\n", "\n", "**Output** \n", "\n", "The output filename used in the example below is `all_scores.csv`, but you can change this to any other filename. It's stored in the S3 bucket and then, as shown in the following code, we copy it into Studio here to process it into a dashboard." ] }, { "cell_type": "code", "execution_count": null, "id": "2f55d005", "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "\n", "import smjsindustry\n", "from smjsindustry import NLPScoreType, NLPSCORE_NO_WORD_LIST\n", "from smjsindustry import NLPScorer\n", "from smjsindustry import NLPScorerConfig\n", "\n", "score_type_list = list(\n", " NLPScoreType(score_type, [])\n", " for score_type in NLPScoreType.DEFAULT_SCORE_TYPES\n", " if score_type not in NLPSCORE_NO_WORD_LIST\n", ")\n", "score_type_list.extend([NLPScoreType(score_type, None) for score_type in NLPSCORE_NO_WORD_LIST])\n", "\n", "nlp_scorer_config = NLPScorerConfig(score_type_list)\n", "\n", "nlp_score_processor = NLPScorer(\n", " role=role, # loading job execution role\n", " instance_count=1, # instances number, limit varies with instance type\n", " instance_type=\"ml.c5.9xlarge\", # ec2 instance type to run the loading job\n", " volume_size_in_gb=30, # size in GB of the EBS volume to use\n", " volume_kms_key=None, # KMS key ID to encrypt the processing volume\n", " output_kms_key=None, # KMS key ID to encrypt processing job outputs\n", " max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours.\n", " sagemaker_session=session, # session object\n", " tags=None,\n", ") # a list of key-value pairs\n", "\n", "nlp_score_processor.calculate(\n", " nlp_scorer_config,\n", " \"text2score\", # input column\n", " \"s3://{}/{}/{}/{}\".format(\n", " bucket, secdashboard_processed_folder, \"output\", \"text2score.csv\"\n", " ), # input from s3 bucket\n", " \"s3://{}/{}/{}\".format(\n", " bucket, secdashboard_processed_folder, \"output\"\n", " ), # output s3 prefix (both bucket and folder names are required)\n", " \"all_scores.csv\", # output file name\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "68fddc02", "metadata": {}, "outputs": [], "source": [ "client.download_file(\n", " bucket,\n", " \"{}/{}/{}\".format(secdashboard_processed_folder, \"output\", \"all_scores.csv\"),\n", " \"all_scores.csv\",\n", ")" ] }, { "cell_type": "markdown", "id": "bb489d14", "metadata": {}, "source": [ "## Stock Screener based on NLP scores\n", "\n", "Once we have added columns for all the NLP scores, we can then screen the table for companies with high scores on any of the attributes. See the table below. \n" ] }, { "cell_type": "code", "execution_count": null, "id": "cc582cb2", "metadata": {}, "outputs": [], "source": [ "qdf = pd.read_csv(\"all_scores.csv\")\n", "qdf.head()" ] }, { "cell_type": "markdown", "id": "9aeb9d53", "metadata": {}, "source": [ "## Add a column with summaries of the text being scored\n", "\n", "We can further enhance the dataframe with summaries of the target text column. As an example, we used the abstractive summarizer from Hugging Face. Since this summarizer can only accommodate roughly 300 words of text, it's not directly applicable to our text, which is much longer (thousands of words). Therefore, we applied the Hugging Face summarizer to groups of paragraphs and pulled it all together to make a single summary. We created a helper function `fullSummary` that is called in the code below to create a summary of each document in the column `text2score`.\n", "\n", "Notice that the output dataframe is now extended with an additional summary column. \n", "\n", "*Note*: An abstractive summarizer restructures the text and loses the original sentences. This is in contrast to an extractive summarizer, which retain the original sentence structure. \n", "\n", "Summarization is time-consuming and this code block takes time. We do the first 5 documents in the `text2score` column to illustrate." ] }, { "cell_type": "code", "execution_count": null, "id": "1e959b7c", "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "qdf[\"summary\"] = \"\"\n", "for i in range(5):\n", " qdf.loc[i, \"summary\"] = fullSummary(qdf.loc[i, \"text2score\"])\n", " print(i, end=\"..\")" ] }, { "cell_type": "markdown", "id": "1997cb06", "metadata": {}, "source": [ "Examine one of the summaries. " ] }, { "cell_type": "code", "execution_count": null, "id": "5488f2fe", "metadata": { "scrolled": true }, "outputs": [], "source": [ "i = 2\n", "print(qdf.summary[i])\n", "print(\"---------------\")\n", "print(qdf.text2score[i])" ] }, { "cell_type": "markdown", "id": "e4058ec0", "metadata": {}, "source": [ "#### Store the curated dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "9f82d7f5", "metadata": {}, "outputs": [], "source": [ "qsf = qdf.drop([\"text2score\"], axis=1)\n", "qsf.to_csv(\"stock_sec_scores.csv\", index=False)" ] }, { "cell_type": "markdown", "id": "2d3ad4a2", "metadata": {}, "source": [ "To complete this example notebook, we provide two artifacts that may be included in a dashboard:\n", "1. Creating an interactive datatable so that a non-technical user may sort and filter the rows of the curated dataframe. \n", "2. Visualizing the differences in documents by NLP scores using radar plots. \n", "\n", "This is shown next. " ] }, { "cell_type": "markdown", "id": "23514437", "metadata": {}, "source": [ "## Create an interactive dashboard\n", "\n", "Using the generated CSV file, you can construct an interactive screening dashboard. \n", "\n", "Run from an R script to construct the dashboard. All you need is just this single block of code below. It will create a browser enabled interactive data table, and save it in a file title `SEC_dashboard.html`. You may open it in a browser." ] }, { "cell_type": "markdown", "id": "c2bfb138", "metadata": {}, "source": [ "### Install `R`" ] }, { "cell_type": "code", "execution_count": null, "id": "1d85bbd9", "metadata": {}, "outputs": [], "source": [ "!sudo yum -y install R" ] }, { "cell_type": "code", "execution_count": null, "id": "fb42b2c2", "metadata": {}, "outputs": [], "source": [ "import subprocess\n", "\n", "ret_code = subprocess.call([\"/usr/bin/Rscript\", \"sec-dashboard/Dashboard.R\"])" ] }, { "cell_type": "markdown", "id": "3d603b19", "metadata": {}, "source": [ "After the notebook finishes running, open the `SEC_Dashboard.html` file that was created. You might need to click `Trust HTML` in the upper left corner to see the filterable table and the content of it. The following screenshot shows an example of the filterable table." ] }, { "cell_type": "code", "execution_count": null, "id": "9a3cb49a", "metadata": {}, "outputs": [], "source": [ "from IPython.display import Image\n", "\n", "Image(\"sec-dashboard/dashboard.png\", width=800, height=600)" ] }, { "cell_type": "markdown", "id": "cd8bd248", "metadata": {}, "source": [ "## Visualizing the text through the NLP scores\n", "\n", "The following visualization function shows how to create a *radar plot* to compare two SEC filings using their normalized NLP scores. The scores are normalized using min-max scaling on each NLP score. The radar plot is useful because it shows the overlap (and consequently, the difference) between the documents." ] }, { "cell_type": "code", "execution_count": null, "id": "d3e3b792", "metadata": {}, "outputs": [], "source": [ "## Read in the scores\n", "scores = pd.read_csv(\"stock_sec_scores.csv\")\n", "\n", "# Choose whichever filings you want to compare for the 2nd and 3rd parameter\n", "createRadarChart(scores, 2, 5)" ] }, { "cell_type": "markdown", "id": "e3db15dd", "metadata": {}, "source": [ "## Further support\n", "\n", "The [SEC filings retrieval API operations](https://sagemaker-jumpstart-industry-pack.readthedocs.io/en/latest/smjsindustry.finance.data_loader.html) we introduced at the beginning of this example notebook also download and parse other SEC forms, such as 495, 497, 497K, S-3ASR, and N-1A. If you need further support for any other types of finance documents, reach out to the SageMaker JumpStart team through [AWS Support](https://console.aws.amazon.com/support/) or [AWS Developer Forums for Amazon SageMaker](https://forums.aws.amazon.com/forum.jspa?forumID=285).\n", "\n", "## References\n", "\n", "\n", "1. [What\u2019s New post](https://aws.amazon.com/about-aws/whats-new/2021/09/amazon-sagemaker-jumpstart-multimodal-financial-analysis-tools/)\n", "\n", "\n", "2. Blogs:\n", "\n", " * [Use SEC text for ratings classification using multimodal ML in Amazon SageMaker JumpStart](https://aws.amazon.com/blogs/machine-learning/use-sec-text-for-ratings-classification-using-multimodal-ml-in-amazon-sagemaker-jumpstart/)\n", " * [Use pre-trained financial language models for transfer learning in Amazon SageMaker JumpStart](https://aws.amazon.com/blogs/machine-learning/use-pre-trained-financial-language-models-for-transfer-learning-in-amazon-sagemaker-jumpstart/)\n", "\n", "\n", "3. Documentation and links to the SageMaker JumpStart Industry Python SDK:\n", "\n", " * ReadTheDocs: https://sagemaker-jumpstart-industry-pack.readthedocs.io/en/latest/index.html\n", " * PyPI: https://pypi.org/project/smjsindustry/\n", " * GitHub Repository: https://github.com/aws/sagemaker-jumpstart-industry-pack/\n", " * Official SageMaker Developer Guide: https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart-industry.html" ] }, { "cell_type": "markdown", "id": "af91d0b9", "metadata": {}, "source": [ "## Licenses\n", "\n", "The SageMaker JumpStart Industry product and its related materials are under the [Legal License Terms](https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/smfinance-notebook-dependency/licenses.txt)." ] }, { "cell_type": "markdown", "id": "bede6bbd", "metadata": {}, "source": [ ">**Important**: \n", ">(1) This notebook is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. (2) This notebook uses data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR\u2019s [access terms and conditions](https://www.sec.gov/os/accessing-edgar-data)." ] }, { "cell_type": "markdown", "id": "6c373cac", "metadata": {}, "source": [ "This notebook utilizes certain third-party open source software packages at install-time or run-time (\u201cExternal Dependencies\u201d) that are subject to copyleft license terms you must accept in order to use it. If you do not accept all the applicable license terms, you should not use the notebook. We recommend that you consult your company\u2019s open source approval policy before proceeding.\n", "Provided below is a list of External Dependencies and the applicable license identification as indicated by the documentation associated with the External Dependencies as of Amazon\u2019s most recent review.\n", "- R v3.5.2: GPLv3 license (https://www.gnu.org/licenses/gpl-3.0.html)\n", "- DT v0.19.1: GPLv3 license (https://github.com/rstudio/DT/blob/master/LICENSE)\n", "\n", "THIS INFORMATION IS PROVIDED FOR CONVENIENCE ONLY. AMAZON DOES NOT PROMISE THAT\n", "THE LIST OR THE APPLICABLE TERMS AND CONDITIONS ARE COMPLETE, ACCURATE, OR\n", "UP-TO-DATE, AND AMAZON WILL HAVE NO LIABILITY FOR ANY INACCURACIES. YOU SHOULD\n", "CONSULT THE DOWNLOAD SITES FOR THE EXTERNAL DEPENDENCIES FOR THE MOST COMPLETE\n", "AND UP-TO-DATE LICENSING INFORMATION.\n", "\n", "YOUR USE OF THE EXTERNAL DEPENDENCIES IS AT YOUR SOLE RISK. IN NO EVENT WILL\n", "AMAZON BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT,\n", "INDIRECT, CONSEQUENTIAL, SPECIAL, INCIDENTAL, OR PUNITIVE DAMAGES (INCLUDING\n", "FOR ANY LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, OR\n", "COMPUTER FAILURE OR MALFUNCTION) ARISING FROM OR RELATING TO THE EXTERNAL\n", "DEPENDENCIES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, EVEN\n", "IF AMAZON HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS\n", "AND DISCLAIMERS APPLY EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW.\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-jumpstart|nlp_score_dashboard_sec|Dashboard_SEC_Filings.ipynb)\n" ] } ], "metadata": { "instance_type": "ml.g4dn.2xlarge", "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 5 }