# Neptune ML and Embedding Generation
This Notebook is a complete walk through of using neptune graph embeddings to create movie recommendations for IMDb Box Office and Mojo dataset.

# Prequisites
The code below requires some pre-requisite steps like creating Amazon Neptune Cluster and setting up NeptuneML with necessary functions, roles and job. To create the stack, please use the [Amazon Neptune Starter Template](https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-quick-start.html). In addition, if you are not creating a SageMaker notebook instance from the Neptune console, please check the [graph notebook github](https://github.com/aws/graph-notebook) on installing the graph notebook library and adding your cluster information to `%%graph_notebook_config`

In [None]:
!pip install tqdm

In [None]:
import neptune_ml_utils as neptune_ml
import pandas as pd
import json
import numpy as np
import os
import requests
import boto3
import io
import pickle
from tqdm import tqdm

# Set your necessary input varaibles


In [None]:
# name of s3 bucket
s3_bucket_uri = "" 

# s3 location where you want your export results stored
processed_folder = f"s3://{s3_bucket_uri}/experiments/neptune-export/"

In [None]:
# remove trailing slashes
s3_bucket_uri = s3_bucket_uri[:-1] if s3_bucket_uri.endswith("/") else s3_bucket_uri

### Connect your export service to this cluster's export job

Replace the URI below with your **NeptuneExportApiUri** from the template. E.g. If the URI is `https://********.execute-api.us-west-2.amazonaws.com/v1/neptune-export` use only `**********.execute-api.us-west-2.amazonaws.com/v1` for the URI below. 


In [None]:
# export uri
expo = ""

In [None]:
neptune_ml.check_ml_enabled()

In [None]:
export_params = {
 "command": "export-pg",
 "params": {
 "endpoint": neptune_ml.get_host(),
 "profile": "neptune_ml",
 "cloneCluster": True,
 },
 "outputS3Path": processed_folder,
 "additionalParams": {"neptune_ml": {"version": "v2.0"}},
 "jobSize": "medium",
}

## Create export job
Creates an export job that will export the graph from Amazon Neptune to Amazon S3.

In [None]:
%%neptune_ml export start --export-url {expo} --export-iam --store-to export_results --wait-timeout 1000000
${export_params}

In [None]:
%neptune_ml export status --export-url {expo} --export-iam --job-id {export_results['jobId']} --store-to export_results

## Set the location of the processed results

In [None]:
export_results['processed_location'] = processed_folder

## Data Processing
The export job includes `training-data-configuration.json`. Use this file to add or remove any nodes or edges that you dont want to provide for training. E.g. if you want to predict the link between two nodes, you can remove that link in this configuration file. For more information, see [Editing training configuration file](https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-processing-training-config-file.html)

In [None]:
!aws s3 cp {export_results['processed_location']} . --recursive

In [None]:
folder = sorted([file if file.split("_")[0].isnumeric() else "local" for file in sorted(os.listdir(os.getcwd()))])[0]
export_results['processed_location'] = export_results['processed_location']+folder

*Optional* Make edits and re-upload the configuration files

In [None]:
!aws s3 cp {folder}/training-data-configuration.json {export_results['processed_location']}/training-data-configuration.json

## Create Data Processing Job
You made need to increase the limit if you run into ResourceLimitExceeded (Go to Service Quotas)

In [None]:
job_name = neptune_ml.get_training_job_name("link-pred")
processing_params = f"""--config-file-name training-data-configuration.json \
--job-id {job_name}-DP \
--s3-input-uri {export_results['outputS3Uri']} \
--s3-processed-uri {export_results['processed_location']} \
--model-type kge \
--instance-type ml.m5.2xlarge
"""

%neptune_ml dataprocessing start --store-to processing_results {processing_params}

In [None]:
%neptune_ml dataprocessing status --job-id {processing_results['id']} --store-to processing_results

In [None]:
dp_id = processing_results["id"]

## Submit a training job
You made need to increase the limit if you run into ResourceLimitExceeded (Go to Service Quotas)

In [None]:
training_job_name = dp_id + "training"
training_job_name = "".join(training_job_name.split("-"))
training_params = f"--job-id train-{training_job_name} \
--data-processing-id {dp_id} \
--instance-type ml.m5.24xlarge \
--s3-output-uri s3://{str(s3_bucket_uri)}/training/{training_job_name}/"
%neptune_ml training start --store-to training_results {training_params}
print(training_results)

In [None]:
%neptune_ml training status --job-id {training_results['id']} --store-to training_status_results

# Download Embeddings

## Mapping Embeddings to Original Node Ids

In [None]:
# get output job location using job name

neptune_ml.get_embeddings(training_status_results["id"])
neptune_ml.get_mapping(training_status_results["id"])

f = open(
 "/home/ec2-user/SageMaker/model-artifacts/" + training_status_results["id"] + "/mapping.info",
 "rb",
)
mapping = pickle.load(f)

node2id = mapping["node2id"]
localid2globalid = mapping["node2gid"]
data = np.load(
 "/home/ec2-user/SageMaker/model-artifacts/" + training_status_results["id"] + "/embeddings/entity.npy"
)

embd_to_sum = mapping["node2id"]
full = len(list(embd_to_sum["movie"].keys()))
ITEM_ID = []
KEY = []
VALUE = []
for ii in tqdm(range(full)):
 node_id = list(embd_to_sum["movie"].keys())[ii]
 index = localid2globalid["movie"][node2id["movie"][node_id]]
 embedding = data[index]
 ITEM_ID += [node_id] * embedding.shape[0]
 KEY += [i for i in range(embedding.shape[0])]
 VALUE += list(embedding)

meta_df = pd.DataFrame({"ITEM_ID": ITEM_ID, "KEY": KEY, "VALUE": VALUE})
meta_df.to_csv("new_embeddings.csv")

### Upload embeddings

In [None]:
s3_destination = "s3://"+s3_bucket_uri+"/embeddings/"+"new_embeddings.csv"

In [None]:
!aws s3 cp new_embeddings.csv {s3_destination}