# Create Graph Dataset

In this notebook, we put [BuzzFeed dataset](https://github.com/KaiDMML/FakeNewsNet/tree/old-version/Data/BuzzFeed) from the 2018 version of FakeNewsNet into a format that can be loaded to a Neptune cluster. To get the raw data, you can:
1. Clone the [FakeNewsNet repository](https://github.com/KaiDMML/FakeNewsNet) from GitHub
2. Checkout the old-version branch
3. Change directory to Data/BuzzFeed

Once we have created `nodes` and `edges` csv files that are compatible with Amazon Neptune, we upload them to a staging S3 bucket and then to our Neptune database.

## Setup

In [None]:
# import required libraries
import pandas as pd
import numpy as np
import scipy.io
import json
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import boto3
import sagemaker
import utils.neptune_ml_utils as neptune_ml

## Read Data

This notebook assumes BuzzFeed data from the 2018 version of FakeNewsNet are located under `./Data/BuzzFeed/` relative to this notebook.

In [None]:
%%bash

REPO=$(pwd)
cd ../
git clone https://github.com/KaiDMML/FakeNewsNet
cd FakeNewsNet
git checkout old-version
cd $REPO
cp -r ../FakeNewsNet/Data/* ./Data/

In [None]:
# read raw data for users 
users = pd.read_csv('./Data/BuzzFeed/User.txt', header=None)

In [None]:
users.head()

Each row in the above DataFrame provides a UIID for the corresponding user in the dataset!

In [None]:
users.shape

We have a total of 15,257 users in this dataset!

In [None]:
# read raw data for news 
news = pd.read_csv('./Data/BuzzFeed/News.txt', header=None)

In [None]:
news.head()

Each row in the above DataFrame provides a name and Id for the corresponding news in the dataset!

In [None]:
news.shape

We have a total of 182 news in this dataset!

In [None]:
# read data about news_user relationships
news_user = pd.read_csv('./Data/BuzzFeed/BuzzFeedNewsUser.txt', sep='\t', header=None)

In [None]:
news_user.head()

In the above DataFrame, the news_id in the first column is posted/spreaded by the user_id in the second column n times, where n is the value in the third column!

In [None]:
news_user.shape

In [None]:
news_user[2].sum()

There are 22,779 unique news_user relationships and a total of 25,240 news_user relationships (accounting for users that have spread a news more than once) in the dataset!

In [None]:
# read data about user_user relationships
user_user = pd.read_csv('./Data/BuzzFeed/BuzzFeedUserUser.txt', sep='\t', header=None)

In [None]:
user_user.head()

In the above DataFrame, user_id in the first column follows the user_id in the second column.

In [None]:
user_user.shape

There are a total of 634,750 user_user relationships (i.e. social links) in the dataset!

In [None]:
# read raw data about user features
user_features = scipy.io.loadmat('./Data/BuzzFeed/UserFeature.mat')['X'].toarray()

In [None]:
user_features.shape

There are 109,626 features for each user! We will reduce dimentionality of the user features using PCA.

In [None]:
# reduce dimentionality of user_features using PCA
X = user_features
n = 100 # number of PCs
pca = PCA(n_components = n)
X_pca = pca.fit_transform(X)

## Create Nodes Table

In this section we create a DataFrame that will define nodes and their properties in the graph, in a format that is compatible with Amazon Neptune (with Apache TinkerPop Gremlin).

In [None]:
# create ~id and ~label for user nodes
users['row_num'] = users.index
users['~id'] = users.apply(lambda x: 'user_'+str(x['row_num']+1), axis=1)
users['~label'] = 'user'
# add user_features as a property for each user node
users['user_features:Double[]'] = np.nan
for i, r in users.iterrows():
 string = ";".join([format(val,'.53f') for val in X_pca[i,:]])
 users.loc[i, 'user_features:Double[]'] = string

In [None]:
users.head()

In [None]:
# create ~id and ~label for news nodes
news['row_num'] = news.index
news['~id'] = news.apply(lambda x: 'news_'+str(x['row_num']+1), axis=1)
news['~label'] = 'news'
# specify news_type as a property for news nodes
news['news_type:String'] = news.apply(lambda x: x[0].split('_')[1], axis=1)

In [None]:
news.head()

In [None]:
news.tail()

In [None]:
# list of supposedly-authors appearing in the dataset who are not actually authors
# we will filter them out when creating author nodes from NewsContent data
non_authors = ['View All Posts', 'Cnn National Politics Reporter', 'Cnn White House Producer',
 'Senior Political Reporter', 'Cnn Pentagon Correspondent', 'Cnn Senior Congressional Producer']

In [None]:
# initialize news_title column in news dataframe with null values
news['news_title:String'] = np.nan

In [None]:
# extract list of authors and publishers from NewsContent files (i.e. authors and publishers nodes)
authors_list = []
publishers_list = []
for nwz in news[0]:
 
 if nwz.split('_')[1]=='Real':
 path = './Data/BuzzFeed/RealNewsContent/'+nwz+'-Webpage.json'
 else:
 path = './Data/BuzzFeed/FakeNewsContent/'+nwz+'-Webpage.json'
 
 with open(path) as fp:
 
 webpage = json.load(fp)
 
 if 'title' in webpage:
 news_title = webpage.get('title')
 # populate news_title column in news dataframe
 news.loc[news[0]==nwz, 'news_title:String'] = news_title
 
 
 if 'source' in webpage:
 publisher = webpage.get('source')
 if publisher not in publishers_list:
 publishers_list.append(publisher)

 if 'authors' in webpage: 
 for author in webpage.get('authors'):
 if author not in authors_list and author not in non_authors:
 authors_list.append(author)

In [None]:
news.head()

In [None]:
len(publishers_list)

There are 28 punlishers in the dataset!

In [None]:
publishers_list[:5]

In [None]:
len(authors_list)

There are 126 authors in the dataset!

In [None]:
authors_list[:5]

In [None]:
# extract author_publisher, author_news and publisher_news relationships
# from NewsContent files (i.e. author_publisher, author_news and publisher_news edges)
author_publisher = []
author_news = []
publisher_news = []

for news_id, nwz in enumerate(news[0]):
 
 if nwz.split('_')[1]=='Real':
 path = './Data/BuzzFeed/RealNewsContent/'+nwz+'-Webpage.json'
 else:
 path = './Data/BuzzFeed/FakeNewsContent/'+nwz+'-Webpage.json'
 
 with open(path) as fp:
 
 webpage = json.load(fp)
 
 if 'source' in webpage:
 publisher = webpage.get('source')
 publisher_id = publishers_list.index(publisher)
 # publisher ==> "published" ==> news
 publisher_news.append((publisher_id+1, news_id+1))

 if 'authors' in webpage: 
 for author in webpage.get('authors'):
 if author not in non_authors:
 author_id = authors_list.index(author)
 # author ==> "wrote_for" ==> publisher
 author_publisher.append((author_id+1, publisher_id+1))
 # author ==> "wrote" ==> news
 author_news.append((author_id+1, news_id+1))

In [None]:
# create dataframe for author nodes
authors_df = pd.DataFrame(authors_list)
authors_df['row_num'] = authors_df.index
authors_df['~id'] = authors_df.apply(lambda x: 'author_'+str(x['row_num']+1), axis=1)
authors_df['~label'] = 'author'
authors_df['author_name:String'] = authors_df[0]

In [None]:
authors_df.head()

In [None]:
# create dataframe for publisher nodes
publishers_df = pd.DataFrame(publishers_list)
publishers_df['row_num'] = publishers_df.index
publishers_df['~id'] = publishers_df.apply(lambda x: 'publisher_'+str(x['row_num']+1), axis=1)
publishers_df['~label'] = 'publisher'
publishers_df['publisher_website:String'] = publishers_df[0]

In [None]:
publishers_df.head()

In [None]:
# concatenate all nodes dataframes to create an overall nodes (i.e. vertices) dataframe
nodes = pd.concat([news, users, publishers_df, authors_df], sort=True, ignore_index=True)

In [None]:
# drop unwanted columns
nodes = nodes.drop(nodes.columns[[0, 1]], axis=1)

In [None]:
nodes.shape

We have a total of 15593 nodes in the graph!

In [None]:
# user nodes
nodes.loc[nodes['~label']=='user'].head()

In [None]:
# news nodes
nodes.loc[nodes['~label']=='news'].head()

In [None]:
# publisher nodes
nodes.loc[nodes['~label']=='publisher'].head()

In [None]:
# author nodes
nodes.loc[nodes['~label']=='author'].head()

## Create Edges Table

In [None]:
# create a list of edges from all edge types including edge labels 
edges_list = []

for i, r in user_user.iterrows():
 edges_list.append(('user_user_'+str(i+1), 'user_'+str(r[0]), 'user_'+str(r[1]), 'follows', np.nan))
 
for i, r in news_user.iterrows():
 edges_list.append(('news_user_'+str(i+1), 'news_'+str(r[0]), 'user_'+str(r[1]), 'spread_by', r[2]))
 
for i, item in enumerate(author_news):
 edges_list.append(('author_news_'+str(i+1), 'author_'+str(item[0]), 'news_'+str(item[1]), 'wrote', np.nan))
 
for i, item in enumerate(publisher_news):
 edges_list.append(('publisher_news_'+str(i+1), 'publisher_'+str(item[0]), 'news_'+str(item[1]), 'published', np.nan))
 
for i, item in enumerate(author_publisher):
 edges_list.append(('author_publisher_'+str(i+1), 'author_'+str(item[0]), 'publisher_'+str(item[1]), 'wrote_for', np.nan))

In [None]:
# convert edges_list to a dataframe
edges = pd.DataFrame(edges_list, columns=['~id', '~from', '~to', '~label', 'weight:Int'])

In [None]:
edges.head()

In [None]:
edges.loc[edges['~label']=='spread_by'].head()

In [None]:
edges.shape

We have a total of 658,203 edges across all edge types!

## Save Nodes and Edges to File

In [None]:
!mkdir -p ./Data/upload

In [None]:

 
nodes['user_features:Double[]'] = nodes['user_features:Double[]'].fillna("")
edges['weight:Int'] = edges['weight:Int'].fillna(1).apply(lambda gi:str(int(gi)))

In [None]:
nodes.to_csv('./Data/upload/nodes.csv', index=False)

In [None]:
edges.to_csv('./Data/upload/edges.csv', index=False)

## Upload to S3 Bucket

In [None]:
sess = sagemaker.Session()
bucket = sess.default_bucket()
bucket = bucket #''
prefix = 'fake-news-detection/data'
s3_client = boto3.client('s3')

In [None]:
resp = s3_client.upload_file('./Data/upload/nodes.csv', bucket, f"{prefix}/nodes.csv")
resp = s3_client.upload_file('./Data/upload/edges.csv', bucket, f"{prefix}/edges.csv")