Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: Apache-2.0

# Building an Identity Graph Application on Amazon Neptune

This notebook shows how Amazon Neptune can be used to build an identity graph for applications in marketing and targeted advertising using a smaller version of the dataset found in the Amazon Web Services blog post, [Building a customer identity graph with Amazon Neptune](https://aws.amazon.com/blogs/database/building-a-customer-identity-graph-with-amazon-neptune/).
 
 - [Background](#Background)
 - [Getting Started](#Getting-Started)
 - [Cross-Device Graphs](#Cross-Device-Graphs)
 - [Targeted Promotions](#Targeted-Promotions)
 - [Audience Segmentation](#Audience-Segmentation)
 - [Conclusion](#Conclusion)
 - [What's Next?](#What's-Next?)

## Background

An identity graph provides a single unified view of customers and prospects by linking multiple identifiers such as cookies, device identifiers, IP addresses, email IDs, and internal enterprise IDs to a known person or anonymous profile using privacy-compliant methods. Typically, identity graphs are part of a larger identity resolution architecture. Identity resolution is the process of matching human identity across a set of devices used by the same person or a household of persons for the purposes of building a representative identity, or known attributes, for targeted advertising.

The following notebook walks you through a sample solution for identity graph and how it can be used within a larger identity resolution architecture using an open dataset and the use of a graph database, Amazon Neptune. In this notebook, we also show a number of data visualizations that allow one to better understand the structure of an identity graph and the aspects of an identity resolution dataset and use case. We also showcase some additional use cases that can be exposed using this particular dataset.



## Getting Started

The dataset used in this notebook is derived from the **CIKM Cup 2016 Track 1: Cross-Device Entity Linking Challenge**. This dataset contains anonymized clickstream data for a set of anonymized user IDs representing the same user across multiple devices, as well as hashed site URLs and HTML titles those users visited. For example 'http://www.amazon.com' could be represented as 'c94174b63350fd53' in this dataset and 'http://www.amazon.com/gp/bestsellers' could be represented as 'c94174b63350fd53/1e8deebfc8e36e85/2215e2a3f89eeba7' where each of the directories in the url are hashed as well. In a similar method, usernames, cookies, and device IDs are also hashed. To make this dataset more interesting to use, we have added additional features to this data using the [python faker](https://faker.readthedocs.io/en/master/) library. Combining the CIKM dataset with the manufactured data, we have stitched the data together to create the following graph data model:

"drawing"

This data model could be used as the starting point for your own identity graph. In most cases, an organization will have data related to website visits, click events, IP addresses, and some combination of anonymous users and registered (signed-on) users. Anonymous users in this data model are represented as "transient IDs" and known users are represented as "persistent IDs". The process of linking anonymous users to known users is accomplished through a process called [identity (or entity) resolution](https://en.wikipedia.org/wiki/Record_linkage#Entity_resolution). We have assumed here that identity resolution has already taken place and we are storing the relationships between the anonymous users and their resolved known users. The process of identity resolution can involve both deterministic patterns (such as matching users coming from the same public IP address within a certain time frame) or through more probabilistic means (using machine learning).

### Loading the Identity Graph

To get started with our Identity Graph, we will load a set of data from the data model shown above into our Neptune cluster. Run the following two cells to download and load this data. It will take approximately 5-6 minutes to complete this next step. This will load approximately 120,000 clickstream events with their corresponding user information. The first cell will download the data from S3 into your notebook instance. The second cell uses the `%seed` command to load this data to your Neptune cluster.

***NOTE: If using this notebook in a locally installed deployment of graph-notebook, you may need to copy the contents of the first cell below, add `sudo` to the beginning of each line, and run these commands in a bash terminal window. Otherwise, you may get 'permission denied' errors. If running this notebook in a Neptune Notebook or SageMaker Notebook instance, this will work as is.***

In [None]:
%%bash

GRAPH_NOTEBOOK_INSTALL_PATH=`pip show graph-notebook | grep "Location" | cut -d ' ' -f2`
mkdir $GRAPH_NOTEBOOK_INSTALL_PATH/graph_notebook/seed/queries/propertygraph/gremlin/identity
cd $GRAPH_NOTEBOOK_INSTALL_PATH/graph_notebook/seed/queries/propertygraph/gremlin/identity
curl -s "https://aws-admartech-samples.s3.amazonaws.com/identity-graph-notebook-data/idgraph.zip" > \
 ./idgraph.zip
unzip ./idgraph.zip
rm -f ./idgraph.zip

In [None]:
%seed --model Property_Graph --language gremlin --dataset identity --run

### Reviewing the Dataset

Before proceeding, let's check that the dataset loaded correctly. In the following two queries, we want to count each of the vertices and edges by their corresponding vertex or edge label (the type of vertex or edge from the graph data model picture above). You should get an output that matches the following (**NOTE: you may see more than this if you have previously loaded data into your Neptune cluster.**):

For vertices:

{'persistentId': 166, 'website': 41326, 'transientId': 671, 'identityGroup': 50, 'IP': 171, 'websiteGroup': 3758}

In [None]:
%%gremlin

g.V().groupCount().by(label)

For edges:

{'member': 166, 'visited': 117808, 'links_to': 41326, 'uses': 676, 'has_identity': 671}

In [None]:
%%gremlin

g.E().groupCount().by(label)

## Example Use Cases and Applications

In the following sections we will showcase the many ways in which identity graphs are used. Many of these use cases focus around better understanding user behavior as users interact with an online platform. These insights provide you with a means of servicing customers in a targeted manner or converting potential product interest into sales opportunities.


### Cross-Device Graphs

**Advertisers want to find out information about user interests to provide an accurate targeting. The data should be based on the activity of the user across all devices.**

Suppose you are hosting a web platform and collecting clickstream data as users browse your site or use your mobile app. In the majority of situations, users using your platform will be anonymous (or non-registered or logged in users). However, these anonymous users may be linked to other known users in that have used our platform before. We can join (or resolve) the identity of the anonymous user with attributes we know about existing users to make some assumptions (based off of known user behavior and heuristics) in order to know more about this anonymous user. We can then use this information to target the user with advertising, special offers, discounts, etc.

Let's use an example where we have an anonymous user id ('ed4982b00e323383583f30236e5b1f11'). We want to know more about this user and if they are linked to other users on our platform. This anonymous user is considered a "transient ID" in our graph data model (see picture above). Assuming this user does not have a link to a known user, or "persistent ID", how might we find connections from this transient ID to other known user IDs? Looking at the data model, you can see that "transient ID" vertices in our graph are connected to "IP Address" vertices. We can traverse across "IP Address" vertices to get to other linked "transient IDs" that might be linked to a known user. Let's do that in the following graph query. Run the following cell, and then click on the Graph tab to see an output displaying the path from the anonymous user to the known user:

In [None]:
%%gremlin 

g.V('ed4982b00e323383583f30236e5b1f11').
 out('uses'). // traverse the users edge to get to IP addresses
 in('uses'). // go from the IP address vertex to other associated transient IDs
 in('has_identity'). //go from the found transient IDs to known users (persistent IDs)
 dedup(). // remove duplicate persistent IDs
 path() // show a path from the unknown anonymous user to a known user

As you can see in the output above, we have found one known user linked to this anonymous user via the same used public IP address. We can take this a step further by looking at other linked known users that exist in the same household in order to determine common attributes that we may want to use in targeting an offer or ad to this anonymous user. Let's look at the associated household context for this known user by building on to the query from above. Again, click on the Graph tab after the query is executed.

In [None]:
%%gremlin

g.V('ed4982b00e323383583f30236e5b1f11').
 out('uses'). // traverse the users edge to get to IP addresses
 in('uses'). // go from the IP address vertex to other associated transient IDs
 in('has_identity'). //go from the found transient IDs to known users (persistent IDs)
 dedup(). // remove duplicate persistent IDs
 in('member').as('household'). //found the household associated with the found known user
 out('member'). // traverse back to other known users in this household
 out('has_identity'). //find other associated anonymous users associated to the known users
 out('visited','uses'). //look at the other websites/products browed by those anonymous users
 path().from('household') //display the graph from household to associated clickstream events

The graph above shows an entire "household subgraph" containing all known users in a household and the associated web activity. This contains useful information regarding what product pages or website subpages a given household is browsing and can be used to target offers to individuals in this household.

### Targeted Promotions

**Ecommerce publishers want to convince undecided users to purchase the product by offering them discount codes as soon as they have met certain criteria. Find all users who have visited product page at least X times in the last 30 days, but did not buy anything (have not visited thank you page).**

Another method for looking at user behavior is to determine what users are interested in a specific product but haven't converted into a purchase. This can be captured in clickstream data by looking at clickstream events related to a product page and an associated "thank you" page (or an event showing a conversion to a purchase). In the following scenario we are going to assume a product page with the web url of 'c94174b63350fd53' (remember, these are hashed values - this could be http://www.amazon.com) and a thank you page with the url of 'c94174b63350fd53/1e8deebfc8e36e85/b5509c3fb28c4e4f'. Given this thank you page, let's traverse our graph to determine what users have looked at our website 'c94174b63350fd53' but have not made a purchase (or have not browsed to 'c94174b63350fd53/1e8deebfc8e36e85/b5509c3fb28c4e4f'). 

Let's first look at the users who have made a purchase. Run the following query and click on the Graph tab to view the results.

In [None]:
%%gremlin

g.V().has('url','c94174b63350fd53'). //find product page
 out('links_to').has('url','c94174b63350fd53/1e8deebfc8e36e85/b5509c3fb28c4e4f'). //find thank you page
 in('visited'). //traverse to users who have browsed the thank you page
 in('has_identity'). //traverse to the associated known users (persistent IDs)
 path()

Now, let's look at the converse and find all users that have browsed our site but have not converted to a sale (or viewed the thank you page). We can run the following query looking for users that have landed on our product page, viewed other pages on the site, and yet have not viewed the thank you page. Run the following query and view the Graph tab to see the results.

In [None]:
%%gremlin

g.V().has('url','c94174b63350fd53'). //find product page
 out('links_to'). //find other views but not the thank you page
 where(not(has('url','c94174b63350fd53/1e8deebfc8e36e85/b5509c3fb28c4e4f'))).
 in('visited'). //get transient IDs from these views but you have not viewed the thank you page
 where(not(out('visited').has('url','c94174b63350fd53/1e8deebfc8e36e85/b5509c3fb28c4e4f'))).
 in('has_identity').dedup(). //fetch associated unique known users
 path()

The graph above will show all associated persistent IDs (known users), transient IDs (associated anonymous users), and the other sites they have viewed on our site. We can use this data to determine user behavior and make offers or discounts to entice a product conversion.

### Audience Segmentation

**Advertisers want to generate audiences for demand side platform (DSP) targeting. The specific audience could be the users who are interested in specific car brands.**

Using our dataset, let's assume a website group is associated with a given brand (could be http://www.amazon.com). In this particular case, let's assume it is a brand of automobile. Given our brand, we may want to look at the entire associated audience and extract certain characteristics/demographics from that audience. We can do this by looking at the entire subgraph related to the brand. Let's use the website group 'c94174b63350fd53' to model this pattern. Starting with 'c94174b63350fd53', let's traverse the graph and extract all associated known users. First, let's see how many clickstream event types and the size of the users associated with the brand subgraph. Run the following query below to see these statistics.

In [None]:
%%gremlin

g.V().has('url','c94174b63350fd53').
 project('subpages','audience_size').
 by(out('links_to').count()).
 by(out('links_to').in('visited').count())

Next, let's take a look at the entire brand subgraph. Run the following query. This will display all associated transient/persistent IDs, website subpages, and brands (website groups) associated with the targeted brand. Run the following query and open the Graph tab to see the results.

In [None]:
%%gremlin

g.V().has('url','c94174b63350fd53').out('links_to').in('visited').in('has_identity').dedup().path()


The goal of this query is to find all user devices (transient IDs) that interacted with any of the subpages on that domain, given one website subpage vertex.

In other words, we are interested in all user device or cookie identifiers (also the ones that did not explicitly interacted with the brand, but belongs to the same user), that showed up on any of the domain pages. This information is useful for the purposes of retargeting.


## Conclusion

Identity graphs are a key component in being able to analyze consumer behavior and provide deep insights on how best to interact with potential customers. In this notebook, we walked through three potential use cases for building an identity graph. Cross-device graphs discuss how to unify user IDs across multiple devices and how to build household subgraphs. Targeted promotions showcase an example of isolating a subgraph of undecided consumers for targeted discounts. Lastly, audience segmentation discusses ways to identify a set of unique users based on all interactions with the brand through its websites. The patterns displayed above are just a small subset of how identity graphs can be leveraged.

## What's Next?

To build an identity graph solution that incorporates Neptune, we recommend the following resources:
 
- [Getting Started with Amazon Neptune](https://pages.awscloud.com/AWS-Learning-Path-Getting-Started-with-Amazon-Neptune_2020_LP_0009-DAT.html) is a video-based learning path that shows you how to create and connect to a Neptune database, choose a data model and query language, author and tune graph queries, and integrate Neptune with other Amazon Web services.
- Before you begin designing your database, consult the [Amazon Web Services Reference Architectures for Using Graph Databases](https://github.com/aws-samples/aws-dbs-refarch-graph/) GitHub repo, where you can browse examples of reference deployment architectures, and learn more about building a graph data model and choosing a query language.
- For links to documentation, blog posts, videos, and code repositories with samples and tools, see the [Amazon Neptune developer resources](https://aws.amazon.com/neptune/developer-resources/).
- Neptune ML makes it possible to build and train useful machine learning models on large graphs in hours instead of weeks. To find out how to set up and use a graph neural network, see [Using Amazon Neptune ML for machine learning on graphs](https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning.html).
- [Identity Graphs on Amazon Web Services](https://aws.amazon.com/advertising-marketing/identity-graph/) showcases Amazon Web Services solutions specifically designed for identity graphs, focusing on advertising and marketing.
- Cox Automotive scales digital personalization using an identity graph powered by Amazon Neptune with this [blog post](https://aws.amazon.com/blogs/database/cox-automotive-scales-digital-personalization-using-an-identity-graph-powered-by-amazon-neptune/) and [presentation](https://youtu.be/I7_b1xkQ7Dc).