{ "cells": [ { "cell_type": "markdown", "id": "046e11b8", "metadata": {}, "source": [ "# Neptune ML and Embedding Generation\n", "This Notebook is a complete walk through of using neptune graph embeddings to create movie recommendations for IMDb Box Office and Mojo dataset.\n", "\n", "# Prequisites\n", "The code below requires some pre-requisite steps like creating Amazon Neptune Cluster and setting up NeptuneML with necessary functions, roles and job. To create the stack, please use the [Amazon Neptune Starter Template](https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-quick-start.html). In addition, if you are not creating a SageMaker notebook instance from the Neptune console, please check the [graph notebook github](https://github.com/aws/graph-notebook) on installing the graph notebook library and adding your cluster information to `%%graph_notebook_config`" ] }, { "cell_type": "code", "execution_count": null, "id": "c8a034b7", "metadata": {}, "outputs": [], "source": [ "!pip install tqdm" ] }, { "cell_type": "code", "execution_count": null, "id": "39ac5334", "metadata": {}, "outputs": [], "source": [ "import neptune_ml_utils as neptune_ml\n", "import pandas as pd\n", "import json\n", "import numpy as np\n", "import os\n", "import requests\n", "import boto3\n", "import io\n", "import pickle\n", "from tqdm import tqdm" ] }, { "cell_type": "markdown", "id": "c5d26f08", "metadata": {}, "source": [ "# Set your necessary input varaibles\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4a7d1e75", "metadata": {}, "outputs": [], "source": [ "# name of s3 bucket\n", "s3_bucket_uri = \"\" \n", "\n", "# s3 location where you want your export results stored\n", "processed_folder = f\"s3://{s3_bucket_uri}/experiments/neptune-export/\"" ] }, { "cell_type": "code", "execution_count": null, "id": "bb8517cf", "metadata": {}, "outputs": [], "source": [ "# remove trailing slashes\n", "s3_bucket_uri = s3_bucket_uri[:-1] if s3_bucket_uri.endswith(\"/\") else s3_bucket_uri" ] }, { "cell_type": "markdown", "id": "46ab4a1b", "metadata": {}, "source": [ "### Connect your export service to this cluster's export job\n", "\n", "Replace the URI below with your **NeptuneExportApiUri** from the template. E.g. If the URI is `https://********.execute-api.us-west-2.amazonaws.com/v1/neptune-export` use only `**********.execute-api.us-west-2.amazonaws.com/v1` for the URI below. \n" ] }, { "cell_type": "code", "execution_count": null, "id": "2b6fd058", "metadata": {}, "outputs": [], "source": [ "# export uri\n", "expo = \"\"" ] }, { "cell_type": "code", "execution_count": null, "id": "e6077ec0", "metadata": {}, "outputs": [], "source": [ "neptune_ml.check_ml_enabled()" ] }, { "cell_type": "code", "execution_count": null, "id": "7069e6a5", "metadata": {}, "outputs": [], "source": [ "export_params = {\n", " \"command\": \"export-pg\",\n", " \"params\": {\n", " \"endpoint\": neptune_ml.get_host(),\n", " \"profile\": \"neptune_ml\",\n", " \"cloneCluster\": True,\n", " },\n", " \"outputS3Path\": processed_folder,\n", " \"additionalParams\": {\"neptune_ml\": {\"version\": \"v2.0\"}},\n", " \"jobSize\": \"medium\",\n", "}" ] }, { "cell_type": "markdown", "id": "8538f828", "metadata": {}, "source": [ "## Create export job\n", "Creates an export job that will export the graph from Amazon Neptune to Amazon S3." ] }, { "cell_type": "code", "execution_count": null, "id": "e1efb698", "metadata": {}, "outputs": [], "source": [ "%%neptune_ml export start --export-url {expo} --export-iam --store-to export_results --wait-timeout 1000000\n", "${export_params}" ] }, { "cell_type": "code", "execution_count": null, "id": "f730af70", "metadata": {}, "outputs": [], "source": [ "%neptune_ml export status --export-url {expo} --export-iam --job-id {export_results['jobId']} --store-to export_results" ] }, { "cell_type": "markdown", "id": "d3dfc4bb", "metadata": {}, "source": [ "## Set the location of the processed results" ] }, { "cell_type": "code", "execution_count": null, "id": "af5445d9", "metadata": {}, "outputs": [], "source": [ "export_results['processed_location'] = processed_folder" ] }, { "cell_type": "markdown", "id": "49b02c49", "metadata": {}, "source": [ "## Data Processing\n", "The export job includes `training-data-configuration.json`. Use this file to add or remove any nodes or edges that you dont want to provide for training. E.g. if you want to predict the link between two nodes, you can remove that link in this configuration file. For more information, see [Editing training configuration file](https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning-processing-training-config-file.html)" ] }, { "cell_type": "code", "execution_count": null, "id": "96babf75", "metadata": {}, "outputs": [], "source": [ "!aws s3 cp {export_results['processed_location']} . --recursive" ] }, { "cell_type": "code", "execution_count": null, "id": "f3d710d7", "metadata": {}, "outputs": [], "source": [ "folder = sorted([file if file.split(\"_\")[0].isnumeric() else \"local\" for file in sorted(os.listdir(os.getcwd()))])[0]\n", "export_results['processed_location'] = export_results['processed_location']+folder" ] }, { "cell_type": "markdown", "id": "6a4377b6", "metadata": {}, "source": [ "*Optional* Make edits and re-upload the configuration files" ] }, { "cell_type": "code", "execution_count": null, "id": "36941848", "metadata": {}, "outputs": [], "source": [ "!aws s3 cp {folder}/training-data-configuration.json {export_results['processed_location']}/training-data-configuration.json" ] }, { "cell_type": "markdown", "id": "3dc6ce48", "metadata": {}, "source": [ "## Create Data Processing Job\n", "You made need to increase the limit if you run into ResourceLimitExceeded (Go to Service Quotas)" ] }, { "cell_type": "code", "execution_count": null, "id": "016a52a4", "metadata": {}, "outputs": [], "source": [ "job_name = neptune_ml.get_training_job_name(\"link-pred\")\n", "processing_params = f\"\"\"--config-file-name training-data-configuration.json \\\n", "--job-id {job_name}-DP \\\n", "--s3-input-uri {export_results['outputS3Uri']} \\\n", "--s3-processed-uri {export_results['processed_location']} \\\n", "--model-type kge \\\n", "--instance-type ml.m5.2xlarge\n", "\"\"\"\n", "\n", "%neptune_ml dataprocessing start --store-to processing_results {processing_params}" ] }, { "cell_type": "code", "execution_count": null, "id": "c18dae47", "metadata": {}, "outputs": [], "source": [ "%neptune_ml dataprocessing status --job-id {processing_results['id']} --store-to processing_results" ] }, { "cell_type": "code", "execution_count": null, "id": "9a0e5933", "metadata": {}, "outputs": [], "source": [ "dp_id = processing_results[\"id\"]" ] }, { "cell_type": "markdown", "id": "d0018b0f", "metadata": {}, "source": [ "## Submit a training job\n", "You made need to increase the limit if you run into ResourceLimitExceeded (Go to Service Quotas)" ] }, { "cell_type": "code", "execution_count": null, "id": "1c7f88b0", "metadata": {}, "outputs": [], "source": [ "training_job_name = dp_id + \"training\"\n", "training_job_name = \"\".join(training_job_name.split(\"-\"))\n", "training_params = f\"--job-id train-{training_job_name} \\\n", "--data-processing-id {dp_id} \\\n", "--instance-type ml.m5.24xlarge \\\n", "--s3-output-uri s3://{str(s3_bucket_uri)}/training/{training_job_name}/\"\n", "%neptune_ml training start --store-to training_results {training_params}\n", "print(training_results)" ] }, { "cell_type": "code", "execution_count": null, "id": "bbe40eb5", "metadata": {}, "outputs": [], "source": [ "%neptune_ml training status --job-id {training_results['id']} --store-to training_status_results" ] }, { "cell_type": "markdown", "id": "f46394a4", "metadata": {}, "source": [ "# Download Embeddings" ] }, { "cell_type": "markdown", "id": "1ca93a56", "metadata": {}, "source": [ "## Mapping Embeddings to Original Node Ids" ] }, { "cell_type": "code", "execution_count": null, "id": "d3bd0b9e", "metadata": {}, "outputs": [], "source": [ "# get output job location using job name\n", "\n", "neptune_ml.get_embeddings(training_status_results[\"id\"])\n", "neptune_ml.get_mapping(training_status_results[\"id\"])\n", "\n", "f = open(\n", " \"/home/ec2-user/SageMaker/model-artifacts/\" + training_status_results[\"id\"] + \"/mapping.info\",\n", " \"rb\",\n", ")\n", "mapping = pickle.load(f)\n", "\n", "node2id = mapping[\"node2id\"]\n", "localid2globalid = mapping[\"node2gid\"]\n", "data = np.load(\n", " \"/home/ec2-user/SageMaker/model-artifacts/\" + training_status_results[\"id\"] + \"/embeddings/entity.npy\"\n", ")\n", "\n", "embd_to_sum = mapping[\"node2id\"]\n", "full = len(list(embd_to_sum[\"movie\"].keys()))\n", "ITEM_ID = []\n", "KEY = []\n", "VALUE = []\n", "for ii in tqdm(range(full)):\n", " node_id = list(embd_to_sum[\"movie\"].keys())[ii]\n", " index = localid2globalid[\"movie\"][node2id[\"movie\"][node_id]]\n", " embedding = data[index]\n", " ITEM_ID += [node_id] * embedding.shape[0]\n", " KEY += [i for i in range(embedding.shape[0])]\n", " VALUE += list(embedding)\n", "\n", "meta_df = pd.DataFrame({\"ITEM_ID\": ITEM_ID, \"KEY\": KEY, \"VALUE\": VALUE})\n", "meta_df.to_csv(\"new_embeddings.csv\")" ] }, { "cell_type": "markdown", "id": "866000f1", "metadata": {}, "source": [ "### Upload embeddings" ] }, { "cell_type": "code", "execution_count": null, "id": "f964f2ca", "metadata": {}, "outputs": [], "source": [ "s3_destination = \"s3://\"+s3_bucket_uri+\"/embeddings/\"+\"new_embeddings.csv\"" ] }, { "cell_type": "code", "execution_count": null, "id": "8f8848ab", "metadata": {}, "outputs": [], "source": [ "!aws s3 cp new_embeddings.csv {s3_destination}" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }