# SEC405: Scalable, Automated Anomaly Detection with Amazon GuardDuty and SageMaker

## Using IP Insights to score security findings
-------
[Return to the workshop repository](https://github.com/aws-samples/aws-security-workshops/edit/master/detection-ml-wksp/)

Amazon SageMaker IP Insights is an unsupervised anomaly detection algorithm for susicipous IP addresses that uses statistical modeling and neural networks to capture associations between online resources (such as account IDs or hostnames) and IPv4 addresses. Under the hood, it learns vector representations for online resources and IP addresses. This essentially means that if the vector representing an IP address and an online resource are close together, then it is likey for that IP address to access that online resource, even if it has never accessed it before.

In this notebook, we use the Amazon SageMaker IP Insights algorithm to train a model using the ` tuples we generated from the CloudTrail log data, and then use the model to perform inference on the same type of tuples generated from GuardDuty findings to determine how unusual it is to see a particular IP address for a given principal involved with a finding.

After running this notebook, you should be able to:

- obtain, transform, and store data for use in Amazon SageMaker,
- create an AWS SageMaker training job to produce an IP Insights model,
- use the model to perform inference with an Amazon SageMaker endpoint.

If you would like to know more, please check out the [SageMaker IP Inisghts Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html).

## Setup
------
*This notebook was created and tested on a ml.m4.xlarge notebook instance. We recommend using the same, but other instance types should still work.*

The following is a cell that contains Python code and can be executed by clicking the button above labelled "Run". When a cell is running, you will see a star in the parentheses to the left (e.g., `In [*]`), and when it has completed you will see a number in the parentheses. Each click of "Run" will execute the next cell in the notebook.

Go ahead and click **Run** now. You should see the text in the `print` statement get printed just beneath the cell.

All of these cells share the same interpreter, so if a cell imports modules, like this one does, those modules will be available to every subsequent cell.

In [None]:
import boto3
import botocore
import os
import sagemaker

print("Welcome to IP Insights!")

### Configure Amazon S3 Bucket

Before going further, we to specify the S3 bucket that SageMaker will use for input and output data for the model, which will be the bucket where our training and inference tuples from CloudTrail logs and GuardDuty findings, respectively, are located. Edit the following cell to specify the name of the bucket and then run it; you do not need to change the prefix.

In [None]:
# Specify the full name of your "sec405-tuplesbucket" here
bucket = 'sec405-tuplesbucket-########'
prefix = ''

Finally, run the next cell to complete the setup.

In [None]:
execution_role = sagemaker.get_execution_role()

# Check if the bucket exists
try:
 boto3.Session().client('s3').head_bucket(Bucket=bucket)
except botocore.exceptions.ParamValidationError as e:
 print('Hey! You either forgot to specify your S3 bucket'
 ' or you gave your bucket an invalid name!')
except botocore.exceptions.ClientError as e:
 if e.response['Error']['Code'] == '403':
 print("Hey! You don't have permission to access the bucket, {}.".format(bucket))
 elif e.response['Error']['Code'] == '404':
 print("Hey! Your bucket, {}, doesn't exist!".format(bucket))
 else:
 raise
else:
 print('Training input/output will be stored in: s3://{}/{}'.format(bucket, prefix))

## Training

Execute the two cells below to start training. Training should take several minutes to complete. You can look at various training metrics in the log as the model trains. These logs are also available in CloudWatch.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

image = get_image_uri(boto3.Session().region_name, 'ipinsights')


# Configure SageMaker IP Insights input channels
train_key = os.path.join(prefix, 'train', 'cloudtrail_tuples.csv')
s3_train_data = 's3://{}/{}'.format(bucket, train_key)

input_data = {
 'train': sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', content_type='text/csv')
}

In [None]:
# Set up the estimator with training job configuration
ip_insights = sagemaker.estimator.Estimator(
 image, 
 execution_role, 
 train_instance_count=1, 
 train_instance_type='ml.m4.xlarge',
 output_path='s3://{}/{}/output'.format(bucket, prefix),
 sagemaker_session=sagemaker.Session())

# Configure algorithm-specific hyperparameters
ip_insights.set_hyperparameters(
 num_entity_vectors='20000',
 random_negative_sampling_rate='5',
 vector_dim='128', 
 mini_batch_size='1000',
 epochs='5',
 learning_rate='0.01',
)

# Start the training job (should take 3-4 minutes to complete) 
ip_insights.fit(input_data)

In [None]:
print('Training job name: {}'.format(ip_insights.latest_training_job.job_name))


## Deployment
Execute the cell below to deploy the trained model on an endpoint for inference. It should take 5-7 minutes to spin up the instance and deploy the model (the horizontal dashed line represents progress, and it will print an exclamation point \[!\] when it is complete).

In [None]:
predictor = ip_insights.deploy(
 initial_instance_count=1,
 instance_type='ml.m4.xlarge'
)

In [None]:
print('Endpoint name: {}'.format(predictor.endpoint))

## Inference
We can pass data in a variety of formats to our inference endpoint. In this example, we will pass CSV-formmated data.

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer

predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.accept = 'application/json'
predictor.deserializer = json_deserializer

When queried by a principal and an IPAddress, the model returns a score (called 'dot_product') which indicates how expected that event is. In other words, the higher the dot_product, the more normal the event is. First let's run inference on the training (normal) data for sanity check.

In [None]:
import pandas as pd

# Run inference on training (normal) data for sanity check
s3_infer_data = 's3://{}/{}'.format(bucket, train_key)
inference_data = pd.read_csv(s3_infer_data, header=None)
inference_data.head()
predictor.predict(inference_data.values)

Now let's run inference on the GuardDuty findings. Notice that the scores are much lower than the normal scores.

In [None]:
# Run inference on GuardDuty findings
infer_key = os.path.join(prefix, 'infer', 'guardduty_tuples.csv')
s3_infer_data = 's3://{}/{}'.format(bucket, infer_key)
inference_data = pd.read_csv(s3_infer_data, header=None)
inference_data.head()
predictor.predict(inference_data.values)

## Clean-up

To clean up resources created during the workshop, please see the [Cleaning up](https://github.com/aws-samples/aws-security-workshops/blob/cff322dab7cc0b9d71c4f1575c7016389b9dbe64/detection-ml-wksp/README.md) section in the workshop README guide.