# Detect Social Media Fake News using Amazon Neptune ML

In this notebook we demonstrate how to use Graph Machine Learning from Amazon Neptune ML to identify fake news on social media. We created a graph dataset based on the [BuzzFeed data](https://github.com/KaiDMML/FakeNewsNet/tree/old-version/Data/BuzzFeed) from the 2018 version of FakeNewsNet in the `create-graph-dataset.ipynb` notebook. 

Note: Use [these CloudFormation templates](https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-quick-start.html) to quickly spin up a `graph-notebook`, an associted Neptune cluster, and set up all the configurations needed to work with Neptune ML in a `graph-notebook`. You can use the `%graph_notebook_config` magic command to see information about the Neptune cluster associated with your graph-notebook, and `%status` magic command to see the status of your Neptune cluster.

## Setup

In [None]:
# import required libraries
import boto3
import sagemaker
import pandas as pd
import utils.neptune_ml_utils as neptune_ml
# Check to make sure your Neptune cluster is configured to run Neptune ML.
neptune_ml.check_ml_enabled()

In [None]:
sess = sagemaker.Session()
bucket = sess.default_bucket()

# S3 location that will be used to store data, processing results and model artifacts
bucket = bucket #''
prefix = 'fake-news-detection'
s3_uri = f"s3://{bucket}/{prefix}"

## Checking Neptune DB

Check the status of the Neptune cluster:

In [None]:
%status

To verify that the graph dataset is loaded in the Neptune cluster, we run the following Gremlin traversals to see the count of nodes and edges by label:

In [None]:
%%gremlin
g.V().groupCount().by(label).unfold().order().by(keys)

If nodes are loaded correctly, the output would be:

* 126 `author` nodes
* 182 `news` nodes
* 28 `publisher` nodes
* 15,257 `user` nodes

In [None]:
%%gremlin
g.E().groupCount().by(label).unfold().order().by(keys)

If edges are loaded correctly, then the output would be:

* 634,750 `follows` edges
* 174 `published` edges
* 250 `wrote` edges
* 250 `wrote_for` edges

## Preparing for Export
With our data validated, let's simulate new `news` being added into our graph by removing the `news_type` property (i.e. the target variable for machine learning) from two of the `news` nodes. We will treat these nodes as testing nodes later (i.e. will run inference on them at the end to determine whether they're `real` or `fake`)

Let's begin by taking a look at the current value of the `news_type` property for those two `news` nodes.

In [None]:
%%gremlin

g.V().has('news', 'news_title', within("Jeb Bush to lecture at Harvard this fall", "BREAKING: Steps to FORCE FBI Director Comey to Resign In Process – Hearing Decides His Fate Sept 28")).
 valueMap('news_title', 'news_type')

Now let's remove these `news_type` property values from the data.

In [None]:
%%gremlin

g.V().has('news', 'news_title', within("Jeb Bush to lecture at Harvard this fall", "BREAKING: Steps to FORCE FBI Director Comey to Resign In Process – Hearing Decides His Fate Sept 28")).
 properties('news_type').drop()


Let's check those two `news` nodes again to verify that they no longer have `news_type` values.

In [None]:
%%gremlin

g.V().has('news', 'news_title', within("Jeb Bush to lecture at Harvard this fall", "BREAKING: Steps to FORCE FBI Director Comey to Resign In Process – Hearing Decides His Fate Sept 28")).
 valueMap('news_title', 'news_type')

In [None]:
%%gremlin -g T.label -p v,inE,outV,outE,inV,outE,inV,oute,inv,ine,outv,ine,outv,ine,outv

g.V().has('news', 'news_title', within("BREAKING: Steps to FORCE FBI Director Comey to Resign In Process – Hearing Decides His Fate Sept 28"))
.inE("wrote")
.outV()
.outE("wrote_for")
.inV()
.outE("published")
.inV().has('news', 'news_title', within("BREAKING: Steps to FORCE FBI Director Comey to Resign In Process – Hearing Decides His Fate Sept 28"))
.outE("spread_by")
.inV()
.inE("follows")
.outV()
.inE("follows")
.outV()
.inE("follows")
.outV()
.path()
.by(valueMap(true))
.limit(100)

## Exporting Data and Model Configuration

The export process is triggered by calling to the Neptune Export service endpoint. This call contains a configuration object which specifies the type of machine learning model to build, in our case `node classification`, as well as any feature configurations required.

The configuration options provided to the Neptune Export service are broken into two main sections, selecting the target and configuring features. Here we want to classify `news` nodes according to the `news_type` property. 

The second section of the configuration, configuring features, is where we specify details about the types of data stored in our graph and how the machine learning model should interpret that data. When data is exported from Neptune, all properties of all nodes are included. Each property is treated as a separate feature for the ML model. Neptune ML does its best to infer the correct type of feature for a property, in many cases, the accuracy of the model can be improved by specifying information about the property used for a feature. We use [word2vec](https://en.wikipedia.org/wiki/Word2vec) for encoding `news_title` property of `news` nodes, and `numerical` type for `user_features` property of `user` nodes.

In [None]:
export_params={ 
"command": "export-pg", 
"params": { "endpoint": neptune_ml.get_host(),
 "profile": "neptune_ml",
 "useIamAuth": neptune_ml.get_iam(),
 "cloneCluster": False
 }, 
"outputS3Path": f"{s3_uri}/neptune-export",
"additionalParams": {
 "neptune_ml": {
 "version": "v2.0",
 "targets": [
 {
 "node": "news",
 "property": "news_type",
 "type": "classification"
 }
 ],
 "features": [
 {
 "node": "news",
 "property": "news_title",
 "type": "text_word2vec"
 },
 {
 "node": "user",
 "property": "user_features",
 "type": "numerical"
 }
 ]
 }
 },
"jobSize": "medium"}

In [None]:
%%neptune_ml export start --export-url {neptune_ml.get_export_service_host()} --export-iam --wait --store-to export_results
${export_params}

## ML Data Processing

Once the export job is completed we are ready to train our machine learning model and create the inference endpoint. There are three machine learning steps in Neptune ML. The first step (data processing) processes the exported graph dataset using standard feature preprocessing techniques to prepare it for use by [Deep Graph Library (DGL)](https://www.dgl.ai/). This step performs functions such as feature normalization for numeric data and encoding text features using word2vec. At the conclusion of this step the dataset is formatted for model training.

This step is implemented using a SageMaker Processing Job and data artifacts are stored in a pre-specified S3 location once the job is completed. Running the cells below will create the data processing configuration and begin the processing job.

In [None]:
# The training_job_name can be set to a unique value below, otherwise one will be auto generated
training_job_name=neptune_ml.get_training_job_name('fake-news-detection')

processing_params = f"""
--config-file-name training-data-configuration.json
--job-id {training_job_name} 
--s3-input-uri {export_results['outputS3Uri']} 
--s3-processed-uri {str(s3_uri)}/preloading
--instance-type ml.c5.9xlarge 
"""

In [None]:
%neptune_ml dataprocessing start --wait --store-to processing_results {processing_params}

## Model Training
The second step (model training) trains the ML model that will be used for predictions. The model training is done in two stages. The first stage uses a SageMaker Processing job to generate a model training strategy. A model training strategy is a configuration set that specifies what type of model and model hyperparameter ranges will be used for the model training. Once the first stage is complete, the SageMaker Processing job launches a SageMaker Hyperparameter tuning job. The SageMaker Hyperparameter tuning job runs a pre-specified number of model training job trials on the processed data, and stores the model artifacts generated by the training in the output S3 location. Once all the training jobs are complete, the Hyperparameter tuning job also notes the training job that produced the best performing model.

In [None]:
training_params=f"""
--job-id {training_job_name}
--data-processing-id {training_job_name} 
--instance-type ml.c5.18xlarge
--s3-output-uri {str(s3_uri)}/training
--max-hpo-number 30
--max-hpo-parallel 3 """

In [None]:
%neptune_ml training start --wait --store-to training_results {training_params}

### Evaluating HPO Job

In this section we retrieve the results of Hyperparameter Tuning job and summarize hyperparameters of the five best training jobs and their respective model performance. 

In [None]:
tuning_job_name = training_results['hpoJob']['name']
tuner = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

In [None]:
full_df = tuner.dataframe()

if len(full_df) > 0:
 df = full_df[full_df["FinalObjectiveValue"] > -float("inf")]
 if len(df) > 0:
 df = df.sort_values("FinalObjectiveValue", ascending=False)
 print("Number of training jobs with valid objective: %d" % len(df))
 print({"lowest": min(df["FinalObjectiveValue"]), "highest": max(df["FinalObjectiveValue"])})
 pd.set_option("display.max_colwidth", None) # Don't truncate TrainingJobName
 else:
 print("No training jobs have reported valid results yet.")

df.head()

We can see that the best performing training job has acheived an accuracy of ~89%. This training job will be automatically selected by Neptune ML for creating an endpoint in the next step.

## Endpoint Creation
The final step of machine learning is to create an inference endpoint which is an Amazon SageMaker endpoint instance that is launched with the model artifacts produced by the best training job. This endpoint will be used by our graph queries to return the model predictions for the inputs in the request. Once the endpoint is created, it stays active until it is manually deleted.

In [None]:
endpoint_params=f"""
--id {training_job_name}
--model-training-job-id {training_job_name} """

In [None]:
%neptune_ml endpoint create --wait --store-to endpoint_results {endpoint_params}

Once this has completed we get the endpoint name for our newly created inference endpoint. The cell below will set the endpoint name which will be used in the Gremlin queries below.

In [None]:
endpoint=endpoint_results['endpoint']['name']

## Predicting Values using Gremlin Queries
Now that we have our inference endpoint setup let's query our graph to see how the model predicts `news_type` for our new `news` nodes:

In [None]:
%%gremlin
g.with("Neptune#ml.endpoint", "${endpoint}").
 V().has('news_title', "Jeb Bush to lecture at Harvard this fall").
 properties("news_type").with("Neptune#ml.classification").value()

In [None]:
%%gremlin
g.with("Neptune#ml.endpoint", "${endpoint}").
 V().has('news_title', "BREAKING: Steps to FORCE FBI Director Comey to Resign In Process – Hearing Decides His Fate Sept 28")
 .properties("news_type").with("Neptune#ml.classification").value()

We see that the model correctly predicts `news_type` for both test nodes!

## Cleaning Up

Now that we can delete the inference endpoint to avoid recurring costs!

In [None]:
neptune_ml.delete_endpoint(training_job_name)