{
"cells": [
{
"cell_type": "markdown",
"id": "085d4dc9",
"metadata": {},
"source": [
"# Simple Construction of a Multimodal Dataset from SEC Filings and NLP Scores\n"
]
},
{
"cell_type": "markdown",
"id": "f47261e7",
"metadata": {},
"source": [
"---\n",
"\n",
"This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
"\n",
"\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "9a63d481",
"metadata": {},
"source": [
"\n",
"Amazon SageMaker is a fully managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models. The SageMaker Python SDK makes it easy to train and deploy models in Amazon SageMaker with several different machine learning and deep learning frameworks, including PyTorch and TensorFlow.\n",
"\n",
"This notebook shows how to use Amazon SageMaker to deploy a simple solution to retrieve U.S. Securities and Exchange Commission (SEC) filings and construct a dataframe of mixed tabular and text data, called TabText. This is a first step in multimodal machine learning. "
]
},
{
"cell_type": "markdown",
"id": "f365c16e",
"metadata": {},
"source": [
">**Important**: \n",
">This example notebook is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice."
]
},
{
"cell_type": "markdown",
"id": "d2a2bf23",
"metadata": {},
"source": [
"## Why SEC Filings? \n",
"\n",
"Financial NLP is a subset of the rapidly increasing use of ML in finance, but it is the largest; for more information, see this [survey paper](https://arxiv.org/abs/2002.05786). The starting point for a vast amount of financial natural language processing (NLP) is text in SEC filings. The SEC requires companies to report different types of information related to various events involving companies. To find the full list of SEC forms, see [Forms List](https://www.sec.gov/forms) in the *Securities and Exchange Commission (SEC) website*.\n",
"\n",
"SEC filings are widely used by financial services companies as a source of information about companies. Financial services companies may use this information as part of trading, lending, investment, and risk management decisions. Because these filings are required, they are of high quality. They contain forward-looking information that helps with forecasts and are written with a view to the future. In addition, in recent times, the value of historical time-series data has degraded, because economies have been structurally transformed by trade wars, pandemics, and political upheavals. Therefore, text as a source of forward-looking information has been increasing in relevance. \n",
"\n",
"There has been an exponential growth in downloads of SEC filings. To find out more, see [\"How to Talk When a Machine is Listening: Corporate Disclosure in the Age of AI\"]( https://www.nber.org/papers/w27950). This paper reports that the number of machine downloads of corporate 10-K and 10-Q filings increased from 360,861 in 2003 to 165,318,719 in 2016. \n",
"\n",
"There is a vast body of academic and practitioner research that is based on financial text, a significant portion of which is based on SEC filings. A recent review article summarizing this work is [\"Textual Analysis in Finance (2020)\"](https://www.annualreviews.org/doi/abs/10.1146/annurev-financial-012820-032249). "
]
},
{
"cell_type": "markdown",
"id": "b4212d04",
"metadata": {},
"source": [
"## What Does SageMaker Do? \n",
"\n",
"SEC filings are downloaded from the [SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR) website](https://www.sec.gov/edgar/search-and-access), which provides open data access. EDGAR is the primary system under the SEC for companies and others submitting documents under the Securities Act of 1933, the Securities Exchange Act of 1934, the Trust Indenture Act of 1939, and the Investment Company Act of 1940. EDGAR contains millions of company and individual filings. The system processes about 3,000 filings per day, serves up 3,000 terabytes of data to the public annually, and accommodates 40,000 new filers per year on average.\n",
"\n",
"There are several ways to download the data, and some open source packages available to extract the text from these filings. However, these require extensive programming and are not always easy to use. Following, you can find a simple *one*-API call that will create a dataset in a few lines of code, for any period of time and many tickers. \n",
"\n",
"This SageMaker JumpStart Industry example notebook wraps the extraction functionality into a SageMaker processing container. This notebook also provides code samples that enable users to download a dataset of filings with metadata, such as dates and parsed plain text that can then be used for machine learning using other SageMaker tools. You only need to specify a date range and a list of ticker symbols (or Central Index Key codes (CIK) codes, which are the SEC assigned identifier). This notebook does the rest. \n",
"\n",
"Currently, this solution supports extracting a popular subset of SEC forms in plain text (excluding tables). These are 10-K, 10-Q, 8-K, 497, 497K, S-3ASR and N-1A. For each of these, you can find examples following and a brief description of each form. For the 10-K and 10-Q forms, filed every year or quarter, the solution also extracts the Management Discussion and Analysis (MD&A) section, which is the primary forward-looking section in the filing. This section is the one most widely used in financial text analysis. This information is provided automatically in a separate column of the dataframe alongside the full text of the filing. \n",
"\n",
"The extracted dataframe is written to Amazon S3 storage and to the local notebook instance. "
]
},
{
"cell_type": "markdown",
"id": "cceb9811",
"metadata": {},
"source": [
"## Security Requirements"
]
},
{
"cell_type": "markdown",
"id": "03278805",
"metadata": {},
"source": [
"We provide a client library named SageMaker JumpStart Industry Python SDK (`smjsindustry`). The library provides the capability of running processing containers in customers’ Amazon virtual private cloud (VPC). More specifically, when calling `smjsindustry` API operations, customers can specify their VPC configurations such as `subnet-id` and `security-group-id`. SageMaker will launch `smjsindustry` processing containers in the VPC implied by the subnets. The inter-container traffic is specified by the security groups.\n",
" \n",
"Customers can also secure data at rest using their own AWS KMS keys. The `smjsindustry` package encrypts EBS volumes and S3 data if users passes the AWS KMS keys information to the `volume_kms_key` and `output_kms_key` arguments of a SageMaker processor."
]
},
{
"cell_type": "markdown",
"id": "73b8c9e4",
"metadata": {},
"source": [
"## General Steps\n",
"\n",
"This notebook takes the following steps to showcase the APIs from ```smjsindustry``` package:\n",
"\n",
"1. Demonstrate examples to synthetically generated SEC Forms SEC Forms Retrieval.\n",
"2. Demonstrate examples to use SEC Filing Parser to parse the raw file and to generate clear and structured text.\n",
"3. Demonstrate examples to use two text summarizers: JaccardSummarizer and KMedoidsSummarizer from SEC Filing Summarizer.\n",
"3. Demonstrate examples to use SEC Filing NLP Scoring with 11 NLP score types.\n",
"\n",
"**Note**: You can also access this notebook through SageMaker JumpStart that is executable on SageMaker Studio. For more information, see [Amazon SageMaker JumpStart Industry](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart-industry.html) in the **Amazon SageMaker Developer Guide**."
]
},
{
"cell_type": "markdown",
"id": "edd9783e",
"metadata": {},
"source": [
"## SageMaker Notebook Kernel Setup\n",
"\n",
"Recommended kernel is **conda_python3**.\n",
"For the instance type, using a larger instance with sufficient memory can be helpful to download the following materials."
]
},
{
"cell_type": "markdown",
"id": "f27ca1b9",
"metadata": {},
"source": [
"## Load Data, SDK, and Dependencies"
]
},
{
"cell_type": "markdown",
"id": "edd278c1",
"metadata": {},
"source": [
"First, we import required packages and load the S3 bucket from SageMaker session, as shown below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "18846b66",
"metadata": {},
"outputs": [],
"source": [
"import boto3\n",
"import pandas as pd\n",
"import sagemaker"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4684a68a",
"metadata": {},
"outputs": [],
"source": [
"# Prepare the SageMaker session's default S3 bucket and a folder to store processed data\n",
"session = sagemaker.Session()\n",
"region = session._region_name\n",
"bucket = session.default_bucket()\n",
"role = sagemaker.get_execution_role()\n",
"sec_processed_folder = \"jumpstart_industry_sec_processed\""
]
},
{
"cell_type": "markdown",
"id": "4e88d585",
"metadata": {},
"source": [
"The following code cells download the `smjsindustry` SDK, dependencies, and dataset from an S3 bucket prepared by SageMaker JumpStart Industry. You will learn how to use the `smjsindustry` SDK which contains various APIs to curate SEC datasets. The dataset in this example was synthetically generated using the `smjsindustry` package's SEC Forms Retrieval tool."
]
},
{
"cell_type": "markdown",
"id": "87486a34",
"metadata": {},
"source": [
"### Install the `smjsindustry` library\n",
"\n",
"We deliver APIs through the `smjsindustry` client library. The first step requires pip installing a Python package that interacts with a SageMaker processing container. The retrieval, parsing, transforming, and scoring of text is a complex process and uses many different algorithms and packages. To make this seamless and stable for the user, the functionality is packaged into an S3 bucket. For installation and maintenance of the workflow, this approach reduces your effort to a pip install followed by a single API call.\n",
"\n",
"The following code blocks copy the wheel file to install the `smjsindustry` library. It also downloads a synthetic dataset and dependencies to demonstrate the functionality of curating the TabText dataframe. **Please make sure the IAM role attached to your Notebook instance has the AmazonS3FullAccess permission policy.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06fab951",
"metadata": {},
"outputs": [],
"source": [
"notebook_artifact_bucket = f\"jumpstart-cache-prod-{region}\"\n",
"notebook_data_prefix = \"smfinance-notebook-data/smjsindustry-tutorial\"\n",
"notebook_sdk_prefix = \"smfinance-notebook-dependency/smjsindustry\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f882804d",
"metadata": {},
"outputs": [],
"source": [
"# Download example dataset\n",
"data_bucket = f\"s3://{notebook_artifact_bucket}/{notebook_data_prefix}\"\n",
"! aws s3 sync $data_bucket ./"
]
},
{
"cell_type": "markdown",
"id": "4ef8fdc4",
"metadata": {},
"source": [
"Install the `smjsindustry` library and dependencies by running the following code block; the packages are needed for machine learning but aren't available as defaults in SageMaker Notebook instance."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "47f5963a",
"metadata": {},
"outputs": [],
"source": [
"# Install smjsindustry SDK\n",
"sdk_bucket = f\"s3://{notebook_artifact_bucket}/{notebook_sdk_prefix}\"\n",
"!aws s3 sync $sdk_bucket ./\n",
"\n",
"!pip install --no-index smjsindustry-1.0.0-py3-none-any.whl"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e1db85f6",
"metadata": {},
"outputs": [],
"source": [
"%pylab inline"
]
},
{
"cell_type": "markdown",
"id": "1f3deb1d",
"metadata": {},
"source": [
"The preceding line loads in several standard packages, including NumPy, SciPy, and matplotlib."
]
},
{
"cell_type": "markdown",
"id": "0a50b0fe",
"metadata": {},
"source": [
"Next, we import ```smjsindustry``` package, as shown below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "552957ed",
"metadata": {},
"outputs": [],
"source": [
"import smjsindustry\n",
"from smjsindustry.finance import utils\n",
"from smjsindustry import NLPScoreType, NLPSCORE_NO_WORD_LIST\n",
"from smjsindustry import NLPScorerConfig, JaccardSummarizerConfig, KMedoidsSummarizerConfig\n",
"from smjsindustry import Summarizer, NLPScorer\n",
"from smjsindustry.finance.processor import DataLoader, SECXMLFilingParser\n",
"from smjsindustry.finance.processor_config import EDGARDataSetConfig"
]
},
{
"cell_type": "markdown",
"id": "b0168e3b",
"metadata": {},
"source": [
"Next, you can find examples of how to extract the different forms."
]
},
{
"cell_type": "markdown",
"id": "3cb30f6b",
"metadata": {},
"source": [
"## SEC Filing Retrieval "
]
},
{
"cell_type": "markdown",
"id": "19501ff1",
"metadata": {},
"source": [
"### SEC Forms 10-K/10-Q\n",
"\n",
"10-K/10-Q forms are quarterly reports required to be filed by companies. They contain full disclosure of business conditions for the company and also require forward-looking statements of future prospects, usually written into a section known as the \"Management Discussion & Analysis\" section. There also can be a section called \"Forward-Looking Statements\". For more information, see [Form 10-K](https://www.investor.gov/introduction-investing/investing-basics/glossary/form-10-k) in the *Investor.gov webpage*.\n",
"\n",
"Each year firms file three 10-Q forms (quarterly reports) and one 10-K (annual report). Thus, there are in total four reports each year. The structure of the forms is displayed in a table of contents. "
]
},
{
"cell_type": "markdown",
"id": "778f010a",
"metadata": {},
"source": [
"The SEC filing retrieval supports the downloading and parsing of 10-K, 10-Q, 8-K, 497, 497K, S-3ASR and N-1A, seven form types for the tickers or CIKs specified by the user. The following block of code will download full text of the forms and convert it into a dataframe format using a SageMaker session. The code is self-explanatory, and offers customized options to the users. \n",
"\n",
"**Technical notes**:\n",
"\n",
"1. The data loader accesses a container to process the request. There might be some latency when starting up the container, which accounts for a few initial minutes. The actual filings extraction occurs after this. \n",
"2. The data loader only supports processing jobs with only one instance at the moment.\n",
"3. Users are not charged for the waiting time used when the instance is initializing (this takes 3-5 minutes). \n",
"4. The name of the processing job is shown in the run time log. \n",
"5. You can also access the processing job from the [SageMaker console](https://console.aws.amazon.com/sagemaker). On the left navigation pane, choose Processing, Processing job.\n",
"\n",
"\n",
"Users may update any of the settings in the `data_loader` section of the code block below, and in the `dataset_config` section. For a very long list of tickers or CIKs, the job will run for a while, and the `...` stream will indicate activity as it proceeds. \n",
"\n",
"**NOTE**: We recommend that you use CIKs as the input. The tickers are internally converted to CIKs according to this [mapping file](https://www.sec.gov/include/ticker.txt). \n",
"One ticker can map to multiple CIKs, but this solution supports only the latest ticker to CIK mapping. Make sure to provide the old CIKs in the input when you want historical filings."
]
},
{
"cell_type": "markdown",
"id": "209589c3",
"metadata": {},
"source": [
"The following code block shows how to use the SEC Retriever API. You specify system resources (or just choose the defaults below). Also specify the tickers needed, the SEC forms needed, the date range, and the location and name of the file in S3 where the curated data file will be stored in CSV format. The output will show the runtime log from the SageMaker processing container and indicates when it is completed. \n",
"\n",
">**Important**: \n",
">This example notebook uses data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR’s access terms and conditions located in the [Accessing EDGAR Data](https://www.sec.gov/os/accessing-edgar-data) page."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bfaad96a",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"dataset_config = EDGARDataSetConfig(\n",
" tickers_or_ciks=[\"amzn\", \"goog\", \"27904\", \"FB\"], # list of stock tickers or CIKs\n",
" form_types=[\"10-K\", \"10-Q\"], # list of SEC form types\n",
" filing_date_start=\"2019-01-01\", # starting filing date\n",
" filing_date_end=\"2020-12-31\", # ending filing date\n",
" email_as_user_agent=\"test-user@test.com\",\n",
") # user agent email\n",
"\n",
"data_loader = DataLoader(\n",
" role=role, # loading job execution role\n",
" instance_count=1, # instances number, limit varies with instance type\n",
" instance_type=\"ml.c5.2xlarge\", # instance type\n",
" volume_size_in_gb=30, # size in GB of the EBS volume to use\n",
" volume_kms_key=None, # KMS key ID to encrypt the processing volume\n",
" output_kms_key=None, # KMS key ID to encrypt processing job outputs\n",
" max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours.\n",
" sagemaker_session=session, # session object\n",
" tags=None,\n",
") # a list of key-value pairs\n",
"\n",
"data_loader.load(\n",
" dataset_config,\n",
" \"s3://{}/{}/{}\".format(\n",
" bucket, sec_processed_folder, \"output\"\n",
" ), # output s3 prefix (both bucket and folder names are required)\n",
" \"dataset_10k_10q.csv\", # output file name\n",
" wait=True,\n",
" logs=True,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c8cf2456",
"metadata": {},
"source": [
"#### Output\n",
"\n",
"The output of the DataLoader processing job is a dataframe. This job includes 32 filings (4 companies for 8 quarters). The CSV file is downloaded from S3 and then read into a dataframe, as shown in the following few code blocks.\n",
"\n",
"The filing date comes within a month of the end date of the reporting period. The filing date is displayed in the dataframe. The column `\"text\"` contains the full plain text of the filing but the tables are not extracted. The values in the tables in the filings are balance-sheet and income-statement data (numeric and tabular) and are easily available elsewhere as they are reported in numeric databases. The last column (`\"mdna\"`) of the dataframe comprises the Management Discussion & Analysis section, which is also included in the `\"text\"` column. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "79a4116d",
"metadata": {},
"outputs": [],
"source": [
"client = boto3.client(\"s3\")\n",
"client.download_file(\n",
" bucket,\n",
" \"{}/{}/{}\".format(sec_processed_folder, \"output\", \"dataset_10k_10q.csv\"),\n",
" \"dataset_10k_10q.csv\",\n",
")\n",
"data_frame_10k_10q = pd.read_csv(\"dataset_10k_10q.csv\")\n",
"data_frame_10k_10q"
]
},
{
"cell_type": "markdown",
"id": "f816be4d",
"metadata": {},
"source": [
"As an example of a clean parse, print out the text of the first filing. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0749c3ef",
"metadata": {},
"outputs": [],
"source": [
"print(data_frame_10k_10q.text[0])"
]
},
{
"cell_type": "markdown",
"id": "94f5f85f",
"metadata": {},
"source": [
"To read the MD&A section, use the following code to print out the section for the second filing in the dataframe. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af8e54ff",
"metadata": {},
"outputs": [],
"source": [
"print(data_frame_10k_10q.mdna[1])"
]
},
{
"cell_type": "markdown",
"id": "2fa07a01",
"metadata": {},
"source": [
"### SEC Form 8-K \n",
"\n",
"This form is filed for material changes in business conditions. This [Form 8-K page](https://www.sec.gov/fast-answers/answersform8khtm.html) describes the form requirements and various conditions for publishing an 8-K filing. Because there is no set cadence to these filings, several 8-K forms might be filed within a year, depending on how often a company experiences material changes in business conditions. \n",
"\n",
"The API call below is the same as for the 10-K forms; simply change the form type `8-K` to `10-K`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f8476235",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"dataset_config = EDGARDataSetConfig(\n",
" tickers_or_ciks=[\"amzn\", \"goog\", \"27904\", \"FB\"], # list of stock tickers or CIKs\n",
" form_types=[\"8-K\"], # list of SEC form types\n",
" filing_date_start=\"2019-01-01\", # starting filing date\n",
" filing_date_end=\"2020-12-31\", # ending filing date\n",
" email_as_user_agent=\"test-user@test.com\",\n",
") # user agent email\n",
"\n",
"data_loader = DataLoader(\n",
" role=role, # loading job execution role\n",
" instance_count=1, # instances number, limit varies with instance type\n",
" instance_type=\"ml.c5.2xlarge\", # instance type\n",
" volume_size_in_gb=30, # size in GB of the EBS volume to use\n",
" volume_kms_key=None, # KMS key ID to encrypt the processing volume\n",
" output_kms_key=None, # KMS key ID to encrypt processing job outputs\n",
" max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours.\n",
" sagemaker_session=session, # session object\n",
" tags=None,\n",
") # a list of key-value pairs\n",
"\n",
"data_loader.load(\n",
" dataset_config,\n",
" \"s3://{}/{}/{}\".format(\n",
" bucket, sec_processed_folder, \"output\"\n",
" ), # output s3 prefix (both bucket and folder names are required)\n",
" \"dataset_8k.csv\", # output file name\n",
" wait=True,\n",
" logs=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "51c53705",
"metadata": {},
"outputs": [],
"source": [
"client = boto3.client(\"s3\")\n",
"client.download_file(\n",
" bucket, \"{}/{}/{}\".format(sec_processed_folder, \"output\", \"dataset_8k.csv\"), \"dataset_8k.csv\"\n",
")\n",
"data_frame_8k = pd.read_csv(\"dataset_8k.csv\")\n",
"data_frame_8k"
]
},
{
"cell_type": "markdown",
"id": "4bfc6f35",
"metadata": {},
"source": [
"As noted, 8-K forms do not have a fixed cadence, and they depend on the number of times a company changes the material. Therefore, the number of forms varies over time. \n",
"\n",
"Next, print the plain text of the first 8-K form in the dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c23e98fd",
"metadata": {},
"outputs": [],
"source": [
"print(data_frame_8k.text[0])"
]
},
{
"cell_type": "markdown",
"id": "a0f71f95",
"metadata": {},
"source": [
"### Other SEC Forms \n",
"\n",
"We also support SEC forms 497, 497K, S-3ASR, N-1A, 485BXT, 485BPOS, 485APOS, S-3, S-3/A, DEF 14A, SC 13D and SC 13D/A. \n",
"\n",
"#### SEC Form 497\n",
"\n",
"Mutual funds are required to file Form 497 to disclose any information that is material for investors. Funds file their prospectuses using this form as well as proxy statements. The form is also used for Statements of Additional Information (SAI). The forward-looking information in Form 497 comprises the detailed company history, financial statements, a description of products and services, an annual review of the organization, its operations, and the markets in which the company operates. Much of this data is usually audited so is of high quality. For more information, see [SEC Form 497](https://www.investopedia.com/terms/s/sec-form-497.asp). \n",
" \n",
"#### SEC Form 497K \n",
"This is a summary prospectus. It describes the fees and expenses of the fund, its principal investment strategies, principal risks, past performance if any, and some administrative information. Many such forms are filed for example, in Q4 of 2020 a total of 5,848 forms of type 497K were filed. \n",
"\n",
"#### SEC Form S-3ASR \n",
"The S-3ASR is an automatic shelf registration statement which is immediately effective upon filing for use by well-known seasoned issuers to register unspecified amounts of different specified types of securities. This Registration Statement is for the registration of securities under the Securities Act of 1933.\n",
"\n",
"#### SEC Form N-1A \n",
"This registration form is required for establishing open-end management companies. The form can be used for registering both open-end mutual funds and open-end exchange traded funds (ETFs). For more information, see [SEC Form N-1A](https://www.investopedia.com/terms/s/sec-form-n-1a.asp)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7a381c8",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"dataset_config = EDGARDataSetConfig(\n",
" tickers_or_ciks=[\n",
" \"zm\",\n",
" \"709364\",\n",
" \"1829774\",\n",
" ], # list of stock tickers or CIKs, 709364 is the CIK for ROYCE FUND and 1829774 is the CIK for James Alpha Funds Trust\n",
" form_types=[\"497\", \"497K\", \"S-3ASR\", \"N-1A\"], # list of SEC form types\n",
" filing_date_start=\"2021-01-01\", # starting filing date\n",
" filing_date_end=\"2021-02-01\", # ending filing date\n",
" email_as_user_agent=\"test-user@test.com\",\n",
") # user agent email\n",
"\n",
"data_loader = DataLoader(\n",
" role=role, # loading job execution role\n",
" instance_count=1, # instances number, limit varies with instance type\n",
" instance_type=\"ml.c5.2xlarge\", # instance type\n",
" volume_size_in_gb=30, # size in GB of the EBS volume to use\n",
" volume_kms_key=None, # KMS key ID to encrypt the processing volume\n",
" output_kms_key=None, # KMS key ID to encrypt processing job outputs\n",
" max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours.\n",
" sagemaker_session=session, # session object\n",
" tags=None,\n",
") # a list of key-value pairs\n",
"\n",
"data_loader.load(\n",
" dataset_config,\n",
" \"s3://{}/{}/{}\".format(\n",
" bucket, sec_processed_folder, \"output\"\n",
" ), # output s3 prefix (both bucket and folder names are required)\n",
" \"dataset_other_forms.csv\", # output file name\n",
" wait=True,\n",
" logs=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aa9a85b2",
"metadata": {},
"outputs": [],
"source": [
"client = boto3.client(\"s3\")\n",
"client.download_file(\n",
" bucket,\n",
" \"{}/{}/{}\".format(sec_processed_folder, \"output\", \"dataset_other_forms.csv\"),\n",
" \"dataset_other_forms.csv\",\n",
")\n",
"data_frame_other_forms = pd.read_csv(\"dataset_other_forms.csv\")\n",
"data_frame_other_forms"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2183a37f",
"metadata": {},
"outputs": [],
"source": [
"# Example of 497 form\n",
"print(data_frame_other_forms.text[2])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55876546",
"metadata": {},
"outputs": [],
"source": [
"# Example of 497K form\n",
"print(data_frame_other_forms.text[4])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ee568b2f",
"metadata": {},
"outputs": [],
"source": [
"# Example of S-3ASR form\n",
"print(data_frame_other_forms.text[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e876336b",
"metadata": {},
"outputs": [],
"source": [
"# Example of N-1A form\n",
"print(data_frame_other_forms.text[1])"
]
},
{
"cell_type": "markdown",
"id": "e1eb68f1",
"metadata": {},
"source": [
"## SEC Filing Parser"
]
},
{
"cell_type": "markdown",
"id": "5df48ebd",
"metadata": {},
"source": [
"If you have the SEC filings ready locally or in an S3 bucket, you can use the SEC Filing Parser API to parse the raw file and to generate clear and structured text."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ddbbee86",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"parser = SECXMLFilingParser(\n",
" role=role, # loading job execution role\n",
" instance_count=1, # instances number, limit varies with instance type\n",
" instance_type=\"ml.c5.2xlarge\", # instance type\n",
" sagemaker_session=session, # session object\n",
")\n",
"parser.parse(\n",
" \"xml\", # local input folder or S3 path\n",
" \"s3://{}/{}/{}\".format(\n",
" bucket, sec_processed_folder, \"output\"\n",
" ), # output s3 prefix (both bucket and folder names are required)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc3985fb",
"metadata": {},
"outputs": [],
"source": [
"xml_file_name = [\"0001018724-21-000002.txt\", \"0001018724-21-000004.txt\"]\n",
"parsed_file_name = [\"parsed-\" + name for name in xml_file_name]\n",
"\n",
"client = boto3.client(\"s3\")\n",
"for file in parsed_file_name:\n",
" client.download_file(bucket, \"{}/{}/{}\".format(sec_processed_folder, \"output\", file), file)\n",
"\n",
"parsed_res = open(parsed_file_name[0], \"r\")\n",
"print(parsed_res.read())"
]
},
{
"cell_type": "markdown",
"id": "52b08702",
"metadata": {},
"source": [
"## SEC Filing Summarizer \n",
"\n",
"The `smjsindustry` Python SDK provides two text summarizers that extracts concise summaries while preserving key information and overall meaning. `JaccardSummarizer` and `KMedoidsSummarizer` are the text summarizers adopted to the `smjsindustry` Python SDK. \n",
"\n",
"You can configure a `JaccardSummarizer` processor or a `KMedoidsSummarizer` processor using the `smjsindustry` library, and run a processing job using the SageMaker Python SDK. To achieve better performance and reduced training time, the processing job can be initiated with multiple instances. \n",
"\n",
"**Technical Notes**:\n",
"\n",
"1. The summarizers send SageMaker processing job requests to processing containers. It might take a few minutes when spinning up a processing container. The actual filings extraction start after the initial spin-up. \n",
"2. You are not charged for the waiting time used for the initial spin-up. \n",
"3. You can run processing jobs in multiple instances.\n",
"4. The name of the processing job is shown in the runtime log. \n",
"5. You can also access the processing job from the [SageMaker console](https://console.aws.amazon.com/sagemaker). On the left navigation pane, choose Processing, Processing job.\n",
"6. VPC mode is supported for the summarizers."
]
},
{
"cell_type": "markdown",
"id": "875acbeb",
"metadata": {},
"source": [
"### Jaccard Summarizer \n",
"The Jaccard summarizer uses the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index). It provides the main theme of a document by extracting the sentences with the greatest similarity among all sentences. The metric calculates the number of common words between two sentences normalized by the size of the superset of the words in the two sentences. \n",
"\n",
"You can use the `summary_size`, `summary_percentage`, `max_tokens`, and `cutoff` parameters to limit the size of the docs to be summarized (see **Example 1**). \n",
"\n",
"You can also provide your own vocabulary to calculate Jaccard similarities between sentences (see **Example 2**). \n",
"\n",
"The Jaccard summarizer is an extractive summarizer (not abstractive). There are two main reasons for adopting this extractive summarizer:\n",
"- One, the extractive approach retains the original sentences and thus preserves the legal meaning of the sentences. \n",
"- Two, it works fast on very long text as we have in SEC filings. Long text is not easily handled by abstractive summarizers that are based on embeddings from transformers that can ingest a limited number of words. \n",
"\n",
"**Two examples** are shown below:\n",
"- In **Example 1**, JaccardSummarizer for the `dataset_10k_10q_sample.csv'` data (created by data loader) runs against the `'mdna'` column, resulting in a summary of 10% of the original text length. \n",
"- In **Example 2**, JaccardSummarizer for the `'dataset_10k_10q_sample.csv'` data (created by data loader) runs against the `'mdna'` column. The summarizer uses the `custom_vocabulary` list set, which is the union of the customized positive and negative word lists. This creates summary of sentences containing more positive and negative connotations. "
]
},
{
"cell_type": "markdown",
"id": "07eafc16",
"metadata": {},
"source": [
"#### Example 1"
]
},
{
"cell_type": "markdown",
"id": "964e575d",
"metadata": {},
"source": [
"For demonstration purposes, take a sample from the original dataset to reduce the time for training."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88a8948b",
"metadata": {},
"outputs": [],
"source": [
"data_frame_10k_10q_sample = pd.concat(\n",
" [\n",
" data_frame_10k_10q[data_frame_10k_10q[\"form_type\"] == \"10-K\"].sample(n=1),\n",
" data_frame_10k_10q[data_frame_10k_10q[\"form_type\"] == \"10-Q\"].sample(n=1),\n",
" ]\n",
").sample(frac=1)\n",
"data_frame_10k_10q_sample.to_csv(\"dataset_10k_10q_sample.csv\", index=False)\n",
"data_frame_10k_10q_sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8d8d841",
"metadata": {},
"outputs": [],
"source": [
"data_frame_8k_sample = data_frame_8k.sample(n=2)\n",
"data_frame_8k_sample.to_csv(\"dataset_8k_sample.csv\", index=False)\n",
"data_frame_8k_sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10a25c84",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"jaccard_summarizer_config = JaccardSummarizerConfig(summary_percentage=0.1)\n",
"\n",
"jaccard_summarizer = Summarizer(\n",
" role=role, # loading job execution role\n",
" instance_count=1, # instances number, limit varies with instance type\n",
" instance_type=\"ml.c5.2xlarge\", # instance type\n",
" sagemaker_session=session,\n",
") # session object\n",
"\n",
"jaccard_summarizer.summarize(\n",
" jaccard_summarizer_config,\n",
" \"mdna\", # mdna column name\n",
" \"./dataset_10k_10q_sample.csv\", # input file path\n",
" \"s3://{}/{}/{}\".format(\n",
" bucket, sec_processed_folder, \"output\"\n",
" ), # output s3 prefix (both bucket and folder names are required)\n",
" \"Jaccard_Summaries.csv\", # output file name\n",
" new_summary_column_name=\"summary\",\n",
") # add column \"summary\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "44104a3a",
"metadata": {},
"outputs": [],
"source": [
"client = boto3.client(\"s3\")\n",
"client.download_file(\n",
" bucket,\n",
" \"{}/{}/{}\".format(sec_processed_folder, \"output\", \"Jaccard_Summaries.csv\"),\n",
" \"Jaccard_Summaries.csv\",\n",
")\n",
"Jaccard_summaries = pd.read_csv(\"Jaccard_Summaries.csv\")\n",
"Jaccard_summaries.head()"
]
},
{
"cell_type": "markdown",
"id": "9100ccbf",
"metadata": {},
"source": [
"#### Example 2"
]
},
{
"cell_type": "markdown",
"id": "c5fb46e2",
"metadata": {},
"source": [
"Here is the second example, focusing on summaries with sentences containing more positive and negative words. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cf2e6f09",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"positive_word_list = pd.read_csv(\"positive_words.csv\")\n",
"negative_word_list = pd.read_csv(\"negative_words.csv\")\n",
"custom_vocabulary = set(list(positive_word_list) + list(negative_word_list))\n",
"\n",
"jaccard_summarizer_config = JaccardSummarizerConfig(\n",
" summary_percentage=0.1, vocabulary=custom_vocabulary\n",
")\n",
"\n",
"jaccard_summarizer = Summarizer(\n",
" role=role, # loading job execution role\n",
" instance_count=1, # instances number, limit varies with instance type\n",
" instance_type=\"ml.c5.2xlarge\", # instance type\n",
" sagemaker_session=session,\n",
") # session object\n",
"\n",
"jaccard_summarizer.summarize(\n",
" jaccard_summarizer_config,\n",
" \"mdna\", # mdna column name\n",
" \"./dataset_10k_10q_sample.csv\", # input file path\n",
" \"s3://{}/{}/{}\".format(\n",
" bucket, sec_processed_folder, \"output\"\n",
" ), # output s3 prefix (both bucket and folder names are required)\n",
" \"Jaccard_Summaries_pos_neg.csv\", # output file name\n",
" new_summary_column_name=\"summary\",\n",
") # add column \"summary\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9900a78e",
"metadata": {},
"outputs": [],
"source": [
"client = boto3.client(\"s3\")\n",
"client.download_file(\n",
" bucket,\n",
" \"{}/{}/{}\".format(sec_processed_folder, \"output\", \"Jaccard_Summaries_pos_neg.csv\"),\n",
" \"Jaccard_Summaries_pos_neg.csv\",\n",
")\n",
"Jaccard_summaries = pd.read_csv(\"Jaccard_Summaries_pos_neg.csv\")\n",
"Jaccard_summaries.head()"
]
},
{
"cell_type": "markdown",
"id": "09724118",
"metadata": {},
"source": [
"### KMedoids Summarizer\n",
"\n",
"The k-medoids summarizer clusters sentences and produces the medoid of each cluster as summary. You can calculate the distance for clustering by choosing one of the following distance metrics: `'euclidean'`, `'cosine'`, or `'dot-product'`. Medoid initialization methods include `'random'`, `'heuristic'`, `'k-medoids++'`, and `'build'`. You need to enter these options to the k-medoids summarizer configuration (`KMedoidsSummarizerConfig`) in the first line of the following code block. Available options are:\n",
"- For `metric`, `{'euclidean', 'cosine', 'dot-product'}`\n",
"- For `init`, `{'random', 'heuristic', 'k-medoids++', 'build'}`\n",
"\n",
"The size of the summary is specified as the number of sentences needed in the summary. \n",
"\n",
"**Two examples** are shown below:\n",
"- In **Example 1**, KMedoidsSummarizer for the `'dataset_10k_10q_sample.csv'` data (created by data loader above and randomly sampled 2 rows) runs against the 'text' column with only one instance.\n",
"- In **Example 2**, KMedoidsSummarizer for the `'dataset_8k_sample.csv'` data (created by data loader above and randomly sampled 2 rows) runs against the 'text' column with two instances.\n",
"\n",
"For the same reasons as stated for the Jaccard summarizer, the k-medoids summarizer is also an extractive one. "
]
},
{
"cell_type": "markdown",
"id": "2a6b813f",
"metadata": {},
"source": [
"#### Example 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36044223",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"kmedoids_summarizer_config = KMedoidsSummarizerConfig(summary_size=100)\n",
"\n",
"kmedoids_summarizer = Summarizer(\n",
" role=role, # loading job execution role\n",
" instance_count=1, # instances number, limit varies with instance type\n",
" instance_type=\"ml.c5.2xlarge\", # instance type\n",
" volume_size_in_gb=30, # size in GB of the EBS volume to use\n",
" volume_kms_key=None, # KMS key ID to encrypt the processing volume\n",
" output_kms_key=None, # KMS key ID to encrypt processing job outputs\n",
" max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours\n",
" sagemaker_session=session, # session object\n",
" tags=None,\n",
")\n",
"\n",
"kmedoids_summarizer.summarize(\n",
" kmedoids_summarizer_config,\n",
" \"mdna\", # mdna column name\n",
" \"./dataset_10k_10q_sample.csv\", # input from s3 bucket\n",
" \"s3://{}/{}/{}\".format(\n",
" bucket, sec_processed_folder, \"output\"\n",
" ), # output s3 prefix (both bucket and folder names are required)\n",
" \"KMedoids_summaries.csv\", # output file name\n",
" new_summary_column_name=\"summary\", # add column \"summary\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "94bf2559",
"metadata": {},
"outputs": [],
"source": [
"client = boto3.client(\"s3\")\n",
"client.download_file(\n",
" bucket,\n",
" \"{}/{}/{}\".format(sec_processed_folder, \"output\", \"KMedoids_summaries.csv\"),\n",
" \"KMedoids_summaries.csv\",\n",
")\n",
"KMedoids_summaries = pd.read_csv(\"KMedoids_summaries.csv\")\n",
"KMedoids_summaries.head()"
]
},
{
"cell_type": "markdown",
"id": "9d2dbb5d",
"metadata": {},
"source": [
"#### Example 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bbd91e9b",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"kmedoids_summarizer_config = KMedoidsSummarizerConfig(summary_size=100)\n",
"\n",
"kmedoids_summarizer = Summarizer(\n",
" role=role, # loading job execution role\n",
" instance_count=2, # instances number, limit varies with instance type\n",
" instance_type=\"ml.c5.2xlarge\", # instance type\n",
" volume_size_in_gb=30, # size in GB of the EBS volume to use\n",
" volume_kms_key=None, # KMS key ID to encrypt the processing volume\n",
" output_kms_key=None, # KMS key ID to encrypt processing job outputs\n",
" max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours\n",
" sagemaker_session=session, # session object\n",
" tags=None,\n",
")\n",
"\n",
"kmedoids_summarizer.summarize(\n",
" kmedoids_summarizer_config,\n",
" \"text\", # text column name\n",
" \"./dataset_8k_sample.csv\", # input file path\n",
" \"s3://{}/{}/{}\".format(\n",
" bucket, sec_processed_folder, \"output\"\n",
" ), # output s3 prefix (both bucket and folder names are required)\n",
" \"KMedoids_summaries_multi_instance.csv\", # output file name\n",
" new_summary_column_name=\"summary\", # add column \"summary\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "88224afb",
"metadata": {},
"outputs": [],
"source": [
"client = boto3.client(\"s3\")\n",
"client.download_file(\n",
" bucket,\n",
" \"{}/{}/{}\".format(sec_processed_folder, \"output\", \"KMedoids_summaries_multi_instance.csv\"),\n",
" \"KMedoids_summaries_multi_instance.csv\",\n",
")\n",
"KMedoids_summaries_multi_instance = pd.read_csv(\"KMedoids_summaries_multi_instance.csv\")\n",
"KMedoids_summaries_multi_instance.head()"
]
},
{
"cell_type": "markdown",
"id": "c8c645b7",
"metadata": {},
"source": [
"## SEC Filing NLP Scoring \n",
"\n",
"The `smjsindustry` library provides 11 NLP score types by default: `positive`, `negative`, `litigious`, `polarity`, `risk`, `readability`, `fraud`, `safe`, `certainty`, `uncertainty`, and `sentiment`. Each score (except readability and sentiment) has its word list, which is used for scanning and matching with an input text dataset.\n",
"\n",
"- The `readability` score type is calculated adopting the [Gunning fog index](https://en.wikipedia.org/wiki/Gunning_fog_index). \n",
"- The `sentiment` score type adopts [VADER sentiment analysis method](https://pypi.org/project/vaderSentiment/).\n",
"- The `polarity` score type uses the `positive` and `negative` word lists. \n",
"- The rest of the NLP score types (`positive`, `negative`, `litigious`, `risk`, `fraud`, `safe`, `certainty`, and `uncertainty`) evaluates the similarity (word frequency) with their corresponding word lists. For example, the `positive` NLP score has its own word list that contains \"positive\" meanings. To measure the `positive` score, the NLP scorer calculates the proportion of words out of the entire texts, by counting every reading of the words that are in the word list of the `positive` score. Before matching, the words are stemmed to match different tenses of the same word. You can provide your own word list to calculate the predefined NLP scores or create your own score with a new word list.\n",
"\n",
"The NLP score types do not use human-curated word lists such as the dictionary from [Loughran and McDonald](https://sraf.nd.edu/textual-analysis/resources/), which is widely used in academia. Instead, the word lists are generated from word embeddings trained on standard large text corpora; each word list comprises words that are close to the concept word (such as `positive`, `negative`, and `risk` in this case) in an embedding space. These word lists may contain words that a human might list out, but might still occur in the context of the concept word.\n",
"\n",
"These NLP scores are added as new numerical columns to the text dataframe; this creates a multimodal dataframe, which is a mixture of tabular data and longform text, called **TabText**. When submitting this multimodal dataframe for ML, it is a good idea to normalize the columns of NLP scores (usually with standard normalization or min-max scaling).\n",
"\n",
"**Technical notes**:\n",
"\n",
"1. The NLPScorer sends SageMaker processing job requests to processing containers. It might take a few minutes when spinning up a processing container. The actual filings extraction start after the initial spin-up. \n",
"2. You are not charged for the waiting time used for the initial spin-up. \n",
"3. You can run processing jobs in multiple instances.\n",
"4. The name of the processing job is shown in the runtime log.\n",
"5. You can also access the processing job from the [SageMaker console](https://console.aws.amazon.com/sagemaker). On the left navigation pane, choose Processing, Processing job.\n",
"6. NLP scoring can be slow for massive documents such as SEC filings, which contain anywhere from 20K-100K words. Matching to word lists (usually ~200 words or more) can be time-consuming. This is why we have enabled automatic distribution of the rows of the dataframe for this task over multiple EC2 instances. In the example below, this is distributed over 4 instances and the run logs show the different instances in different colors. The user does not need to code up the distributed processing task here, it is done automatically when the number of instances is specified. \n",
"7. VPC mode is supported in this API."
]
},
{
"cell_type": "markdown",
"id": "3621c5c5",
"metadata": {},
"source": [
"**Three examples** are shown below:\n",
"- In **Example 1**, 11 types of NLP scores for the `'dataset_10k_10q_sample.csv'` data (created by the `data_loader` and randomly sampled 2 rows) is generated against the `'mdna'` column.\n",
"- In **Example 2**, customized positive and negative word lists are provided to calculate the positive and negative NLP scores for the `'dataset_10k_10q_sample.csv'` data (created by the `data_loader` and randomly sampled 2 rows) against the `'mdna'` column. \n",
"- In **Example 3**, a customized score type, in this case `'societal'`, is created using a `'societal'` word list. `'dataset_10k_10q_sample.csv'` data is loaded from a local file path."
]
},
{
"cell_type": "markdown",
"id": "061fd12a",
"metadata": {},
"source": [
"The processing job runs on `ml.c5.18xlarge` to reduce the running time. If `ml.c5.18xlarge` is not available in your AWS Region, change to a different CPU-based instance. If you encounter error messages that you've exceeded your quota, contact AWS Support to request a service limit increase for [SageMaker resources](https://console.aws.amazon.com/support/home#/) you want to scale up."
]
},
{
"cell_type": "markdown",
"id": "b33c515a",
"metadata": {},
"source": [
"#### Example 1\n",
"\n",
"It takes about 1 hour to run the following processing job because it computes the entire 11 types of NLP scores."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55d0f8ab",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"import smjsindustry\n",
"from smjsindustry import NLPScoreType, NLPSCORE_NO_WORD_LIST\n",
"from smjsindustry import NLPScorer\n",
"from smjsindustry import NLPScorerConfig\n",
"\n",
"score_type_list = list(\n",
" NLPScoreType(score_type, [])\n",
" for score_type in NLPScoreType.DEFAULT_SCORE_TYPES\n",
" if score_type not in NLPSCORE_NO_WORD_LIST\n",
")\n",
"score_type_list.extend([NLPScoreType(score_type, None) for score_type in NLPSCORE_NO_WORD_LIST])\n",
"\n",
"nlp_scorer_config = NLPScorerConfig(score_type_list)\n",
"\n",
"nlp_score_processor = NLPScorer(\n",
" role=role, # loading job execution role\n",
" instance_count=1, # instances number, limit varies with instance type\n",
" instance_type=\"ml.c5.18xlarge\", # ec2 instance type to run the loading job\n",
" volume_size_in_gb=30, # size in GB of the EBS volume to use\n",
" volume_kms_key=None, # KMS key ID to encrypt the processing volume\n",
" output_kms_key=None, # KMS key ID to encrypt processing job outputs\n",
" max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours\n",
" sagemaker_session=session, # session object\n",
" tags=None,\n",
") # a list of key-value pairs\n",
"\n",
"nlp_score_processor.calculate(\n",
" nlp_scorer_config,\n",
" \"mdna\", # input column\n",
" \"./dataset_10k_10q_sample.csv\", # input from s3 bucket\n",
" \"s3://{}/{}/{}\".format(\n",
" bucket, sec_processed_folder, \"output\"\n",
" ), # output s3 prefix (both bucket and folder names are required)\n",
" \"all_scores.csv\", # output file name\n",
")"
]
},
{
"cell_type": "markdown",
"id": "6c27ef37",
"metadata": {},
"source": [
"The multimodal dataframe after the NLP scoring has completed is shown below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc9e235d",
"metadata": {},
"outputs": [],
"source": [
"client = boto3.client(\"s3\")\n",
"client.download_file(\n",
" bucket, \"{}/{}/{}\".format(sec_processed_folder, \"output\", \"all_scores.csv\"), \"all_scores.csv\"\n",
")\n",
"all_scores = pd.read_csv(\"all_scores.csv\")\n",
"all_scores"
]
},
{
"cell_type": "markdown",
"id": "fa7355d2",
"metadata": {},
"source": [
"#### Example 2\n",
"\n",
"The following example shows how to set custom word lists for `POSITIVE` and `NEGATIVE` score types. The processing job scores only for the two score types. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "461ce6ba",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"import smjsindustry\n",
"from smjsindustry import NLPScoreType, NLPSCORE_NO_WORD_LIST\n",
"from smjsindustry import NLPScorer\n",
"from smjsindustry import NLPScorerConfig\n",
"\n",
"\n",
"custom_positive_word_list = [\n",
" \"good\",\n",
" \"great\",\n",
" \"nice\",\n",
" \"accomplish\",\n",
" \"accept\",\n",
" \"agree\",\n",
" \"believe\",\n",
" \"genius\",\n",
" \"impressive\",\n",
"]\n",
"custom_negative_word_list = [\n",
" \"bad\",\n",
" \"broken\",\n",
" \"deny\",\n",
" \"damage\",\n",
" \"disease\",\n",
" \"guilty\",\n",
" \"injure\",\n",
" \"negate\",\n",
" \"pain\",\n",
" \"reject\",\n",
"]\n",
"\n",
"score_type_pos = NLPScoreType(NLPScoreType.POSITIVE, custom_positive_word_list)\n",
"score_type_neg = NLPScoreType(NLPScoreType.NEGATIVE, custom_negative_word_list)\n",
"\n",
"score_type_list = [score_type_pos, score_type_neg]\n",
"\n",
"nlp_scorer_config = NLPScorerConfig(score_type_list)\n",
"\n",
"nlp_score_processor = NLPScorer(\n",
" role=role, # loading job execution role\n",
" instance_count=1, # instances number, limit varies with instance type\n",
" instance_type=\"ml.c5.18xlarge\", # ec2 instance type to run the loading job\n",
" volume_size_in_gb=30, # size in GB of the EBS volume to use\n",
" volume_kms_key=None, # KMS key for the processing volume\n",
" output_kms_key=None, # KMS key ID for processing job outputs\n",
" max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours\n",
" sagemaker_session=sagemaker.Session(), # session object\n",
" tags=None,\n",
") # a list of key-value pairs\n",
"\n",
"nlp_score_processor.calculate(\n",
" nlp_scorer_config,\n",
" \"mdna\", # input column\n",
" \"./dataset_10k_10q_sample.csv\", # input from s3 bucket\n",
" \"s3://{}/{}/{}\".format(\n",
" bucket, sec_processed_folder, \"output\"\n",
" ), # output s3 prefix (both bucket and folder names are required)\n",
" \"scores_custom_word_list.csv\", # output file name\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56a31287",
"metadata": {},
"outputs": [],
"source": [
"client = boto3.client(\"s3\")\n",
"client.download_file(\n",
" bucket,\n",
" \"{}/{}/{}\".format(sec_processed_folder, \"output\", \"scores_custom_word_list.csv\"),\n",
" \"scores_custom_word_list.csv\",\n",
")\n",
"scores = pd.read_csv(\"scores_custom_word_list.csv\")\n",
"scores"
]
},
{
"cell_type": "markdown",
"id": "b6b4be89",
"metadata": {},
"source": [
"#### Example 3\n",
"\n",
"The following example shows how It might take about 30 minutes to run the following processing job."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "374775f4",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"import smjsindustry\n",
"from smjsindustry import NLPScoreType, NLPSCORE_NO_WORD_LIST\n",
"from smjsindustry import NLPScorer\n",
"from smjsindustry import NLPScorerConfig\n",
"\n",
"societal = pd.read_csv(\"societal_words.csv\", header=None)\n",
"societal_word_list = societal[0].tolist()\n",
"score_type_societal = NLPScoreType(\"societal\", societal_word_list)\n",
"\n",
"score_type_list = [score_type_societal]\n",
"\n",
"nlp_scorer_config = NLPScorerConfig(score_type_list)\n",
"\n",
"nlp_score_processor = NLPScorer(\n",
" role=role, # loading job execution role\n",
" instance_count=1, # instances number, limit varies with instance type\n",
" instance_type=\"ml.c5.18xlarge\", # ec2 instance type to run the loading job\n",
" volume_size_in_gb=30, # size in GB of the EBS volume to use\n",
" volume_kms_key=None, # KMS key ID to encrypt the processing volume\n",
" output_kms_key=None, # KMS key ID to encrypt processing job outputs\n",
" max_runtime_in_seconds=None, # timeout in seconds. Default is 24 hours\n",
" sagemaker_session=session, # session object\n",
" tags=None,\n",
") # a list of key-value pairs\n",
"\n",
"nlp_score_processor.calculate(\n",
" nlp_scorer_config,\n",
" \"mdna\", # input column\n",
" \"./dataset_10k_10q_sample.csv\", # input from s3 bucket\n",
" \"s3://{}/{}/{}\".format(\n",
" bucket, sec_processed_folder, \"output\"\n",
" ), # output s3 prefix (both bucket and folder names are required)\n",
" \"scores_custom_score.csv\", # output file name\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2c14bda",
"metadata": {},
"outputs": [],
"source": [
"client = boto3.client(\"s3\")\n",
"client.download_file(\n",
" bucket,\n",
" \"{}/{}/{}\".format(sec_processed_folder, \"output\", \"scores_custom_score.csv\"),\n",
" \"scores_custom_score.csv\",\n",
")\n",
"custom_scores = pd.read_csv(\"scores_custom_score.csv\")\n",
"custom_scores"
]
},
{
"cell_type": "markdown",
"id": "35ca96cb",
"metadata": {},
"source": [
"## Summary\n",
"\n",
"This notebook showed how to:\n",
"\n",
"1. Retrieve parsed plain text of various SEC filings in one API call, stored as a CSV file and represented in dataframes. \n",
"\n",
"2. Add columns to the dataframe for different summaries. \n",
"\n",
"3. Score the text column using the `NLPScorer` processor for text attributes, such as positivity, negativity, and litigiousness, using the default word list or custom word lists. "
]
},
{
"cell_type": "markdown",
"id": "30301526",
"metadata": {},
"source": [
"## Clean Up\n",
"\n",
"After you are done using this notebook, delete the model artifacts and other resources to avoid any incurring charges.\n",
"\n",
">**Caution:** You need to manually delete resources that you may have created while running the notebook, such as Amazon S3 buckets for model artifacts, training datasets, processing artifacts, and Amazon CloudWatch log groups.\n",
"\n",
"For more information about cleaning up resources, see [Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html) in the *Amazon SageMaker Developer Guide*."
]
},
{
"cell_type": "markdown",
"id": "edb71c91",
"metadata": {},
"source": [
"## Licence\n",
"\n",
"The SageMaker JumpStart Industry product and its related materials are under the [Legal License Terms](https://jumpstart-cache-prod-us-west-2.s3.us-west-2.amazonaws.com/smfinance-notebook-dependency/legal_file.txt)."
]
},
{
"cell_type": "markdown",
"id": "75c12c34",
"metadata": {},
"source": [
"## Notebook CI Test Results\n",
"\n",
"This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"metadata": {
"availableInstances": [
{
"_defaultOrder": 0,
"_isFastLaunch": true,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 4,
"name": "ml.t3.medium",
"vcpuNum": 2
},
{
"_defaultOrder": 1,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 8,
"name": "ml.t3.large",
"vcpuNum": 2
},
{
"_defaultOrder": 2,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 16,
"name": "ml.t3.xlarge",
"vcpuNum": 4
},
{
"_defaultOrder": 3,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 32,
"name": "ml.t3.2xlarge",
"vcpuNum": 8
},
{
"_defaultOrder": 4,
"_isFastLaunch": true,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 8,
"name": "ml.m5.large",
"vcpuNum": 2
},
{
"_defaultOrder": 5,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 16,
"name": "ml.m5.xlarge",
"vcpuNum": 4
},
{
"_defaultOrder": 6,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 32,
"name": "ml.m5.2xlarge",
"vcpuNum": 8
},
{
"_defaultOrder": 7,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 64,
"name": "ml.m5.4xlarge",
"vcpuNum": 16
},
{
"_defaultOrder": 8,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 128,
"name": "ml.m5.8xlarge",
"vcpuNum": 32
},
{
"_defaultOrder": 9,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 192,
"name": "ml.m5.12xlarge",
"vcpuNum": 48
},
{
"_defaultOrder": 10,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 256,
"name": "ml.m5.16xlarge",
"vcpuNum": 64
},
{
"_defaultOrder": 11,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 384,
"name": "ml.m5.24xlarge",
"vcpuNum": 96
},
{
"_defaultOrder": 12,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 8,
"name": "ml.m5d.large",
"vcpuNum": 2
},
{
"_defaultOrder": 13,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 16,
"name": "ml.m5d.xlarge",
"vcpuNum": 4
},
{
"_defaultOrder": 14,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 32,
"name": "ml.m5d.2xlarge",
"vcpuNum": 8
},
{
"_defaultOrder": 15,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 64,
"name": "ml.m5d.4xlarge",
"vcpuNum": 16
},
{
"_defaultOrder": 16,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 128,
"name": "ml.m5d.8xlarge",
"vcpuNum": 32
},
{
"_defaultOrder": 17,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 192,
"name": "ml.m5d.12xlarge",
"vcpuNum": 48
},
{
"_defaultOrder": 18,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 256,
"name": "ml.m5d.16xlarge",
"vcpuNum": 64
},
{
"_defaultOrder": 19,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 384,
"name": "ml.m5d.24xlarge",
"vcpuNum": 96
},
{
"_defaultOrder": 20,
"_isFastLaunch": false,
"category": "General purpose",
"gpuNum": 0,
"hideHardwareSpecs": true,
"memoryGiB": 0,
"name": "ml.geospatial.interactive",
"supportedImageNames": [
"sagemaker-geospatial-v1-0"
],
"vcpuNum": 0
},
{
"_defaultOrder": 21,
"_isFastLaunch": true,
"category": "Compute optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 4,
"name": "ml.c5.large",
"vcpuNum": 2
},
{
"_defaultOrder": 22,
"_isFastLaunch": false,
"category": "Compute optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 8,
"name": "ml.c5.xlarge",
"vcpuNum": 4
},
{
"_defaultOrder": 23,
"_isFastLaunch": false,
"category": "Compute optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 16,
"name": "ml.c5.2xlarge",
"vcpuNum": 8
},
{
"_defaultOrder": 24,
"_isFastLaunch": false,
"category": "Compute optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 32,
"name": "ml.c5.4xlarge",
"vcpuNum": 16
},
{
"_defaultOrder": 25,
"_isFastLaunch": false,
"category": "Compute optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 72,
"name": "ml.c5.9xlarge",
"vcpuNum": 36
},
{
"_defaultOrder": 26,
"_isFastLaunch": false,
"category": "Compute optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 96,
"name": "ml.c5.12xlarge",
"vcpuNum": 48
},
{
"_defaultOrder": 27,
"_isFastLaunch": false,
"category": "Compute optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 144,
"name": "ml.c5.18xlarge",
"vcpuNum": 72
},
{
"_defaultOrder": 28,
"_isFastLaunch": false,
"category": "Compute optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 192,
"name": "ml.c5.24xlarge",
"vcpuNum": 96
},
{
"_defaultOrder": 29,
"_isFastLaunch": true,
"category": "Accelerated computing",
"gpuNum": 1,
"hideHardwareSpecs": false,
"memoryGiB": 16,
"name": "ml.g4dn.xlarge",
"vcpuNum": 4
},
{
"_defaultOrder": 30,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 1,
"hideHardwareSpecs": false,
"memoryGiB": 32,
"name": "ml.g4dn.2xlarge",
"vcpuNum": 8
},
{
"_defaultOrder": 31,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 1,
"hideHardwareSpecs": false,
"memoryGiB": 64,
"name": "ml.g4dn.4xlarge",
"vcpuNum": 16
},
{
"_defaultOrder": 32,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 1,
"hideHardwareSpecs": false,
"memoryGiB": 128,
"name": "ml.g4dn.8xlarge",
"vcpuNum": 32
},
{
"_defaultOrder": 33,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 4,
"hideHardwareSpecs": false,
"memoryGiB": 192,
"name": "ml.g4dn.12xlarge",
"vcpuNum": 48
},
{
"_defaultOrder": 34,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 1,
"hideHardwareSpecs": false,
"memoryGiB": 256,
"name": "ml.g4dn.16xlarge",
"vcpuNum": 64
},
{
"_defaultOrder": 35,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 1,
"hideHardwareSpecs": false,
"memoryGiB": 61,
"name": "ml.p3.2xlarge",
"vcpuNum": 8
},
{
"_defaultOrder": 36,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 4,
"hideHardwareSpecs": false,
"memoryGiB": 244,
"name": "ml.p3.8xlarge",
"vcpuNum": 32
},
{
"_defaultOrder": 37,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 8,
"hideHardwareSpecs": false,
"memoryGiB": 488,
"name": "ml.p3.16xlarge",
"vcpuNum": 64
},
{
"_defaultOrder": 38,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 8,
"hideHardwareSpecs": false,
"memoryGiB": 768,
"name": "ml.p3dn.24xlarge",
"vcpuNum": 96
},
{
"_defaultOrder": 39,
"_isFastLaunch": false,
"category": "Memory Optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 16,
"name": "ml.r5.large",
"vcpuNum": 2
},
{
"_defaultOrder": 40,
"_isFastLaunch": false,
"category": "Memory Optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 32,
"name": "ml.r5.xlarge",
"vcpuNum": 4
},
{
"_defaultOrder": 41,
"_isFastLaunch": false,
"category": "Memory Optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 64,
"name": "ml.r5.2xlarge",
"vcpuNum": 8
},
{
"_defaultOrder": 42,
"_isFastLaunch": false,
"category": "Memory Optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 128,
"name": "ml.r5.4xlarge",
"vcpuNum": 16
},
{
"_defaultOrder": 43,
"_isFastLaunch": false,
"category": "Memory Optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 256,
"name": "ml.r5.8xlarge",
"vcpuNum": 32
},
{
"_defaultOrder": 44,
"_isFastLaunch": false,
"category": "Memory Optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 384,
"name": "ml.r5.12xlarge",
"vcpuNum": 48
},
{
"_defaultOrder": 45,
"_isFastLaunch": false,
"category": "Memory Optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 512,
"name": "ml.r5.16xlarge",
"vcpuNum": 64
},
{
"_defaultOrder": 46,
"_isFastLaunch": false,
"category": "Memory Optimized",
"gpuNum": 0,
"hideHardwareSpecs": false,
"memoryGiB": 768,
"name": "ml.r5.24xlarge",
"vcpuNum": 96
},
{
"_defaultOrder": 47,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 1,
"hideHardwareSpecs": false,
"memoryGiB": 16,
"name": "ml.g5.xlarge",
"vcpuNum": 4
},
{
"_defaultOrder": 48,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 1,
"hideHardwareSpecs": false,
"memoryGiB": 32,
"name": "ml.g5.2xlarge",
"vcpuNum": 8
},
{
"_defaultOrder": 49,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 1,
"hideHardwareSpecs": false,
"memoryGiB": 64,
"name": "ml.g5.4xlarge",
"vcpuNum": 16
},
{
"_defaultOrder": 50,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 1,
"hideHardwareSpecs": false,
"memoryGiB": 128,
"name": "ml.g5.8xlarge",
"vcpuNum": 32
},
{
"_defaultOrder": 51,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 1,
"hideHardwareSpecs": false,
"memoryGiB": 256,
"name": "ml.g5.16xlarge",
"vcpuNum": 64
},
{
"_defaultOrder": 52,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 4,
"hideHardwareSpecs": false,
"memoryGiB": 192,
"name": "ml.g5.12xlarge",
"vcpuNum": 48
},
{
"_defaultOrder": 53,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 4,
"hideHardwareSpecs": false,
"memoryGiB": 384,
"name": "ml.g5.24xlarge",
"vcpuNum": 96
},
{
"_defaultOrder": 54,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 8,
"hideHardwareSpecs": false,
"memoryGiB": 768,
"name": "ml.g5.48xlarge",
"vcpuNum": 192
},
{
"_defaultOrder": 55,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 8,
"hideHardwareSpecs": false,
"memoryGiB": 1152,
"name": "ml.p4d.24xlarge",
"vcpuNum": 96
},
{
"_defaultOrder": 56,
"_isFastLaunch": false,
"category": "Accelerated computing",
"gpuNum": 8,
"hideHardwareSpecs": false,
"memoryGiB": 1152,
"name": "ml.p4de.24xlarge",
"vcpuNum": 96
}
],
"kernelspec": {
"display_name": "Python 3 (Data Science 3.0)",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/sagemaker-data-science-310-v1"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}