{ "cells": [ { "cell_type": "markdown", "id": "08858146", "metadata": {}, "source": [ "# Create Graph Dataset" ] }, { "cell_type": "markdown", "id": "d2c5861d", "metadata": {}, "source": [ "In this notebook, we put [BuzzFeed dataset](https://github.com/KaiDMML/FakeNewsNet/tree/old-version/Data/BuzzFeed) from the 2018 version of FakeNewsNet into a format that can be loaded to a Neptune cluster. To get the raw data, you can:\n", "1. Clone the [FakeNewsNet repository](https://github.com/KaiDMML/FakeNewsNet) from GitHub\n", "2. Checkout the old-version branch\n", "3. Change directory to Data/BuzzFeed\n", "\n", "Once we have created `nodes` and `edges` csv files that are compatible with Amazon Neptune, we upload them to a staging S3 bucket and then to our Neptune database." ] }, { "cell_type": "markdown", "id": "912ff2c5", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": null, "id": "1de76750", "metadata": {}, "outputs": [], "source": [ "# import required libraries\n", "import pandas as pd\n", "import numpy as np\n", "import scipy.io\n", "import json\n", "import numpy as np\n", "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import StandardScaler\n", "import boto3\n", "import sagemaker\n", "import utils.neptune_ml_utils as neptune_ml" ] }, { "cell_type": "markdown", "id": "2c7b7b9b", "metadata": {}, "source": [ "## Read Data" ] }, { "cell_type": "markdown", "id": "582c0878", "metadata": {}, "source": [ "This notebook assumes BuzzFeed data from the 2018 version of FakeNewsNet are located under `./Data/BuzzFeed/` relative to this notebook." ] }, { "cell_type": "code", "execution_count": null, "id": "241a35c8", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "\n", "REPO=$(pwd)\n", "cd ../\n", "git clone https://github.com/KaiDMML/FakeNewsNet\n", "cd FakeNewsNet\n", "git checkout old-version\n", "cd $REPO\n", "cp -r ../FakeNewsNet/Data/* ./Data/" ] }, { "cell_type": "code", "execution_count": null, "id": "799c7429", "metadata": {}, "outputs": [], "source": [ "# read raw data for users \n", "users = pd.read_csv('./Data/BuzzFeed/User.txt', header=None)" ] }, { "cell_type": "code", "execution_count": null, "id": "d226af28", "metadata": {}, "outputs": [], "source": [ "users.head()" ] }, { "cell_type": "markdown", "id": "14ef2c9d", "metadata": {}, "source": [ "Each row in the above DataFrame provides a UIID for the corresponding user in the dataset!" ] }, { "cell_type": "code", "execution_count": null, "id": "73047ca5", "metadata": {}, "outputs": [], "source": [ "users.shape" ] }, { "cell_type": "markdown", "id": "7af69c77", "metadata": {}, "source": [ "We have a total of 15,257 users in this dataset!" ] }, { "cell_type": "code", "execution_count": null, "id": "4551a65e", "metadata": {}, "outputs": [], "source": [ "# read raw data for news \n", "news = pd.read_csv('./Data/BuzzFeed/News.txt', header=None)" ] }, { "cell_type": "code", "execution_count": null, "id": "e7e3c132", "metadata": {}, "outputs": [], "source": [ "news.head()" ] }, { "cell_type": "markdown", "id": "a7b28363", "metadata": {}, "source": [ "Each row in the above DataFrame provides a name and Id for the corresponding news in the dataset!" ] }, { "cell_type": "code", "execution_count": null, "id": "3832f5df", "metadata": {}, "outputs": [], "source": [ "news.shape" ] }, { "cell_type": "markdown", "id": "73f991ab", "metadata": {}, "source": [ "We have a total of 182 news in this dataset!" ] }, { "cell_type": "code", "execution_count": null, "id": "faf33dc5", "metadata": {}, "outputs": [], "source": [ "# read data about news_user relationships\n", "news_user = pd.read_csv('./Data/BuzzFeed/BuzzFeedNewsUser.txt', sep='\\t', header=None)" ] }, { "cell_type": "code", "execution_count": null, "id": "908721bb", "metadata": {}, "outputs": [], "source": [ "news_user.head()" ] }, { "cell_type": "markdown", "id": "802b8c43", "metadata": {}, "source": [ "In the above DataFrame, the news_id in the first column is posted/spreaded by the user_id in the second column n times, where n is the value in the third column!" ] }, { "cell_type": "code", "execution_count": null, "id": "0e22ad0a", "metadata": {}, "outputs": [], "source": [ "news_user.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "2ad7682e", "metadata": {}, "outputs": [], "source": [ "news_user[2].sum()" ] }, { "cell_type": "markdown", "id": "b42e8980", "metadata": {}, "source": [ "There are 22,779 unique news_user relationships and a total of 25,240 news_user relationships (accounting for users that have spread a news more than once) in the dataset!" ] }, { "cell_type": "code", "execution_count": null, "id": "e020eaef", "metadata": {}, "outputs": [], "source": [ "# read data about user_user relationships\n", "user_user = pd.read_csv('./Data/BuzzFeed/BuzzFeedUserUser.txt', sep='\\t', header=None)" ] }, { "cell_type": "code", "execution_count": null, "id": "c216feaa", "metadata": {}, "outputs": [], "source": [ "user_user.head()" ] }, { "cell_type": "markdown", "id": "0af22d32", "metadata": {}, "source": [ "In the above DataFrame, user_id in the first column follows the user_id in the second column." ] }, { "cell_type": "code", "execution_count": null, "id": "8369a815", "metadata": {}, "outputs": [], "source": [ "user_user.shape" ] }, { "cell_type": "markdown", "id": "74c7a92f", "metadata": {}, "source": [ "There are a total of 634,750 user_user relationships (i.e. social links) in the dataset!" ] }, { "cell_type": "code", "execution_count": null, "id": "d0a7cbe2", "metadata": {}, "outputs": [], "source": [ "# read raw data about user features\n", "user_features = scipy.io.loadmat('./Data/BuzzFeed/UserFeature.mat')['X'].toarray()" ] }, { "cell_type": "code", "execution_count": null, "id": "0adde73e", "metadata": {}, "outputs": [], "source": [ "user_features.shape" ] }, { "cell_type": "markdown", "id": "f13edf5c", "metadata": {}, "source": [ "There are 109,626 features for each user! We will reduce dimentionality of the user features using PCA." ] }, { "cell_type": "code", "execution_count": null, "id": "ec80369a", "metadata": {}, "outputs": [], "source": [ "# reduce dimentionality of user_features using PCA\n", "X = user_features\n", "n = 100 # number of PCs\n", "pca = PCA(n_components = n)\n", "X_pca = pca.fit_transform(X)" ] }, { "cell_type": "markdown", "id": "9e44eae3", "metadata": {}, "source": [ "## Create Nodes Table" ] }, { "cell_type": "markdown", "id": "906c4299", "metadata": {}, "source": [ "In this section we create a DataFrame that will define nodes and their properties in the graph, in a format that is compatible with Amazon Neptune (with Apache TinkerPop Gremlin)." ] }, { "cell_type": "code", "execution_count": null, "id": "2812b9c6", "metadata": {}, "outputs": [], "source": [ "# create ~id and ~label for user nodes\n", "users['row_num'] = users.index\n", "users['~id'] = users.apply(lambda x: 'user_'+str(x['row_num']+1), axis=1)\n", "users['~label'] = 'user'\n", "# add user_features as a property for each user node\n", "users['user_features:Double[]'] = np.nan\n", "for i, r in users.iterrows():\n", " string = \";\".join([format(val,'.53f') for val in X_pca[i,:]])\n", " users.loc[i, 'user_features:Double[]'] = string" ] }, { "cell_type": "code", "execution_count": null, "id": "75dd7329", "metadata": {}, "outputs": [], "source": [ "users.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "5be61eca", "metadata": {}, "outputs": [], "source": [ "# create ~id and ~label for news nodes\n", "news['row_num'] = news.index\n", "news['~id'] = news.apply(lambda x: 'news_'+str(x['row_num']+1), axis=1)\n", "news['~label'] = 'news'\n", "# specify news_type as a property for news nodes\n", "news['news_type:String'] = news.apply(lambda x: x[0].split('_')[1], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "ba21205d", "metadata": {}, "outputs": [], "source": [ "news.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "75653980", "metadata": {}, "outputs": [], "source": [ "news.tail()" ] }, { "cell_type": "code", "execution_count": null, "id": "1714e9d8", "metadata": {}, "outputs": [], "source": [ "# list of supposedly-authors appearing in the dataset who are not actually authors\n", "# we will filter them out when creating author nodes from NewsContent data\n", "non_authors = ['View All Posts', 'Cnn National Politics Reporter', 'Cnn White House Producer',\n", " 'Senior Political Reporter', 'Cnn Pentagon Correspondent', 'Cnn Senior Congressional Producer']" ] }, { "cell_type": "code", "execution_count": null, "id": "548904d2", "metadata": {}, "outputs": [], "source": [ "# initialize news_title column in news dataframe with null values\n", "news['news_title:String'] = np.nan" ] }, { "cell_type": "code", "execution_count": null, "id": "03adc4e9", "metadata": {}, "outputs": [], "source": [ "# extract list of authors and publishers from NewsContent files (i.e. authors and publishers nodes)\n", "authors_list = []\n", "publishers_list = []\n", "for nwz in news[0]:\n", " \n", " if nwz.split('_')[1]=='Real':\n", " path = './Data/BuzzFeed/RealNewsContent/'+nwz+'-Webpage.json'\n", " else:\n", " path = './Data/BuzzFeed/FakeNewsContent/'+nwz+'-Webpage.json'\n", " \n", " with open(path) as fp:\n", " \n", " webpage = json.load(fp)\n", " \n", " if 'title' in webpage:\n", " news_title = webpage.get('title')\n", " # populate news_title column in news dataframe\n", " news.loc[news[0]==nwz, 'news_title:String'] = news_title\n", " \n", " \n", " if 'source' in webpage:\n", " publisher = webpage.get('source')\n", " if publisher not in publishers_list:\n", " publishers_list.append(publisher)\n", "\n", " if 'authors' in webpage: \n", " for author in webpage.get('authors'):\n", " if author not in authors_list and author not in non_authors:\n", " authors_list.append(author)" ] }, { "cell_type": "code", "execution_count": null, "id": "fc1fa0f8", "metadata": {}, "outputs": [], "source": [ "news.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "ab229dab", "metadata": {}, "outputs": [], "source": [ "len(publishers_list)" ] }, { "cell_type": "markdown", "id": "d60aa393", "metadata": {}, "source": [ "There are 28 punlishers in the dataset!" ] }, { "cell_type": "code", "execution_count": null, "id": "d7eded5f", "metadata": {}, "outputs": [], "source": [ "publishers_list[:5]" ] }, { "cell_type": "code", "execution_count": null, "id": "56927ad1", "metadata": {}, "outputs": [], "source": [ "len(authors_list)" ] }, { "cell_type": "markdown", "id": "845fed26", "metadata": {}, "source": [ "There are 126 authors in the dataset!" ] }, { "cell_type": "code", "execution_count": null, "id": "ed2df2fe", "metadata": {}, "outputs": [], "source": [ "authors_list[:5]" ] }, { "cell_type": "code", "execution_count": null, "id": "e0d01424", "metadata": {}, "outputs": [], "source": [ "# extract author_publisher, author_news and publisher_news relationships\n", "# from NewsContent files (i.e. author_publisher, author_news and publisher_news edges)\n", "author_publisher = []\n", "author_news = []\n", "publisher_news = []\n", "\n", "for news_id, nwz in enumerate(news[0]):\n", " \n", " if nwz.split('_')[1]=='Real':\n", " path = './Data/BuzzFeed/RealNewsContent/'+nwz+'-Webpage.json'\n", " else:\n", " path = './Data/BuzzFeed/FakeNewsContent/'+nwz+'-Webpage.json'\n", " \n", " with open(path) as fp:\n", " \n", " webpage = json.load(fp)\n", " \n", " if 'source' in webpage:\n", " publisher = webpage.get('source')\n", " publisher_id = publishers_list.index(publisher)\n", " # publisher ==> \"published\" ==> news\n", " publisher_news.append((publisher_id+1, news_id+1))\n", "\n", " if 'authors' in webpage: \n", " for author in webpage.get('authors'):\n", " if author not in non_authors:\n", " author_id = authors_list.index(author)\n", " # author ==> \"wrote_for\" ==> publisher\n", " author_publisher.append((author_id+1, publisher_id+1))\n", " # author ==> \"wrote\" ==> news\n", " author_news.append((author_id+1, news_id+1))" ] }, { "cell_type": "code", "execution_count": null, "id": "e5820205", "metadata": {}, "outputs": [], "source": [ "# create dataframe for author nodes\n", "authors_df = pd.DataFrame(authors_list)\n", "authors_df['row_num'] = authors_df.index\n", "authors_df['~id'] = authors_df.apply(lambda x: 'author_'+str(x['row_num']+1), axis=1)\n", "authors_df['~label'] = 'author'\n", "authors_df['author_name:String'] = authors_df[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "cde037ae", "metadata": {}, "outputs": [], "source": [ "authors_df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "af616792", "metadata": {}, "outputs": [], "source": [ "# create dataframe for publisher nodes\n", "publishers_df = pd.DataFrame(publishers_list)\n", "publishers_df['row_num'] = publishers_df.index\n", "publishers_df['~id'] = publishers_df.apply(lambda x: 'publisher_'+str(x['row_num']+1), axis=1)\n", "publishers_df['~label'] = 'publisher'\n", "publishers_df['publisher_website:String'] = publishers_df[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "eac27cce", "metadata": {}, "outputs": [], "source": [ "publishers_df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "598f1ae2", "metadata": {}, "outputs": [], "source": [ "# concatenate all nodes dataframes to create an overall nodes (i.e. vertices) dataframe\n", "nodes = pd.concat([news, users, publishers_df, authors_df], sort=True, ignore_index=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "ea645938", "metadata": {}, "outputs": [], "source": [ "# drop unwanted columns\n", "nodes = nodes.drop(nodes.columns[[0, 1]], axis=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "fe5c6a1d", "metadata": {}, "outputs": [], "source": [ "nodes.shape" ] }, { "cell_type": "markdown", "id": "bec946e8", "metadata": {}, "source": [ "We have a total of 15593 nodes in the graph!" ] }, { "cell_type": "code", "execution_count": null, "id": "c6bd77f7", "metadata": {}, "outputs": [], "source": [ "# user nodes\n", "nodes.loc[nodes['~label']=='user'].head()" ] }, { "cell_type": "code", "execution_count": null, "id": "7630693a", "metadata": {}, "outputs": [], "source": [ "# news nodes\n", "nodes.loc[nodes['~label']=='news'].head()" ] }, { "cell_type": "code", "execution_count": null, "id": "8cacdae6", "metadata": {}, "outputs": [], "source": [ "# publisher nodes\n", "nodes.loc[nodes['~label']=='publisher'].head()" ] }, { "cell_type": "code", "execution_count": null, "id": "dc2ebb8e", "metadata": {}, "outputs": [], "source": [ "# author nodes\n", "nodes.loc[nodes['~label']=='author'].head()" ] }, { "cell_type": "markdown", "id": "df3729cc", "metadata": {}, "source": [ "## Create Edges Table" ] }, { "cell_type": "code", "execution_count": null, "id": "925052c3", "metadata": {}, "outputs": [], "source": [ "# create a list of edges from all edge types including edge labels \n", "edges_list = []\n", "\n", "for i, r in user_user.iterrows():\n", " edges_list.append(('user_user_'+str(i+1), 'user_'+str(r[0]), 'user_'+str(r[1]), 'follows', np.nan))\n", " \n", "for i, r in news_user.iterrows():\n", " edges_list.append(('news_user_'+str(i+1), 'news_'+str(r[0]), 'user_'+str(r[1]), 'spread_by', r[2]))\n", " \n", "for i, item in enumerate(author_news):\n", " edges_list.append(('author_news_'+str(i+1), 'author_'+str(item[0]), 'news_'+str(item[1]), 'wrote', np.nan))\n", " \n", "for i, item in enumerate(publisher_news):\n", " edges_list.append(('publisher_news_'+str(i+1), 'publisher_'+str(item[0]), 'news_'+str(item[1]), 'published', np.nan))\n", " \n", "for i, item in enumerate(author_publisher):\n", " edges_list.append(('author_publisher_'+str(i+1), 'author_'+str(item[0]), 'publisher_'+str(item[1]), 'wrote_for', np.nan))" ] }, { "cell_type": "code", "execution_count": null, "id": "ee4e2198", "metadata": {}, "outputs": [], "source": [ "# convert edges_list to a dataframe\n", "edges = pd.DataFrame(edges_list, columns=['~id', '~from', '~to', '~label', 'weight:Int'])" ] }, { "cell_type": "code", "execution_count": null, "id": "d37a555c", "metadata": {}, "outputs": [], "source": [ "edges.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "d2eaa332", "metadata": {}, "outputs": [], "source": [ "edges.loc[edges['~label']=='spread_by'].head()" ] }, { "cell_type": "code", "execution_count": null, "id": "dc6c9af2", "metadata": {}, "outputs": [], "source": [ "edges.shape" ] }, { "cell_type": "markdown", "id": "de70fe13", "metadata": {}, "source": [ "We have a total of 658,203 edges across all edge types!" ] }, { "cell_type": "markdown", "id": "35930adb", "metadata": {}, "source": [ "## Save Nodes and Edges to File" ] }, { "cell_type": "code", "execution_count": null, "id": "25145af6", "metadata": {}, "outputs": [], "source": [ "!mkdir -p ./Data/upload" ] }, { "cell_type": "code", "execution_count": null, "id": "e0d28cc3", "metadata": {}, "outputs": [], "source": [ "\n", " \n", "nodes['user_features:Double[]'] = nodes['user_features:Double[]'].fillna(\"\")\n", "edges['weight:Int'] = edges['weight:Int'].fillna(1).apply(lambda gi:str(int(gi)))" ] }, { "cell_type": "code", "execution_count": null, "id": "b064eab6", "metadata": {}, "outputs": [], "source": [ "nodes.to_csv('./Data/upload/nodes.csv', index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "37d646a4", "metadata": {}, "outputs": [], "source": [ "edges.to_csv('./Data/upload/edges.csv', index=False)" ] }, { "cell_type": "markdown", "id": "9ac84185", "metadata": {}, "source": [ "## Upload to S3 Bucket" ] }, { "cell_type": "code", "execution_count": null, "id": "53b32fee", "metadata": {}, "outputs": [], "source": [ "sess = sagemaker.Session()\n", "bucket = sess.default_bucket()\n", "bucket = bucket #''\n", "prefix = 'fake-news-detection/data'\n", "s3_client = boto3.client('s3')" ] }, { "cell_type": "code", "execution_count": null, "id": "5f34b86f", "metadata": {}, "outputs": [], "source": [ "resp = s3_client.upload_file('./Data/upload/nodes.csv', bucket, f\"{prefix}/nodes.csv\")\n", "resp = s3_client.upload_file('./Data/upload/edges.csv', bucket, f\"{prefix}/edges.csv\")" ] }, { "cell_type": "code", "execution_count": null, "id": "75f5c5b8", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 5 }