# Identity Graph Sample
### Made Available by: AWS Adtech/Martech Team

## About this sample solution

An identity graph provides a single unified view of customers and prospects by linking multiple identifiers such as cookies, device identifiers, IP addresses, email IDs, and internal enterprise IDs to a known person or anonymous profile using privacy-compliant methods. Typically, identity graphs are part of a larger identity resolution architecture. Identity resolution is the process of matching human identity across a set of devices used by the same person or a household of persons for the purposes of building a representative identity, or known attributes, for targeted advertising. 

The following notebook walks you through a sample solution for identity graph and how it can be used within a larger identity resolution architecture using an open dataset and the use of a graph database, Amazon Neptune. In this notebook, we also show a number of data visualizations that allow one to better understand the structure of an identity graph and the aspects of an identity resolution dataset and use case. Later in the notebook, we expose some additional use cases that can be exposed using this particular dataset.


## Glossary

* **transient identity** - cookie or user device that interacted with the website
* **persistent identity** - single user, one user might have many transient identifiers, as he might own many devices, or use many cookies
* **group identity** (identityGroup) - group of persistent identites forming up households, or companies
* **group of websites** (websiteGroup) - group of websites with the same domain
* **session** - group of single user events on one domain. User can have many sessions, each session has at least one event. Events a, b, and c belongs to the same session if following conditions holds: 
 * each event is inititated by the same transientId
 * each event is on the same domain; 
 * given timestamp(a) < timestamp(b) < timestamp(c) and maximum delta between two consecutive events X
 * timestamp(b) - timestamp(a) < X and 
 * timestamp(c) - timestamp(a) < X

## Dataset

The dataset used for the purposes of this demo comes from **CIKM Cup 2016 Track 1: Cross-Device Entity Linking Challenge**, https://competitions.codalab.org/competitions/11171.

The dataset contains an anonymized browse log for a set of anonymized userIDs representing the same user across multiple devices, as well as obfuscated site URLs and HTML titles those users visited. There are not much of user attributes, so they had to be generated artificially.

### Dataset download

The dataset has already been loaded into an Amazon Neptune cluster for you to use with this notebook. However, should you choose to inspect the raw data itself, the data can be found here:

Links with CSV files below:
* https://aws-admartech-samples.s3.amazonaws.com/identity-resolution/data/identity_group_edges.csv
* https://aws-admartech-samples.s3.amazonaws.com/identity-resolution/data/identity_group_nodes.csv
* https://aws-admartech-samples.s3.amazonaws.com/identity-resolution/data/ip_edges.csv
* https://aws-admartech-samples.s3.amazonaws.com/identity-resolution/data/ip_nodes.csv
* https://aws-admartech-samples.s3.amazonaws.com/identity-resolution/data/persistent_edges.csv
* https://aws-admartech-samples.s3.amazonaws.com/identity-resolution/data/persistent_nodes.csv
* https://aws-admartech-samples.s3.amazonaws.com/identity-resolution/data/transient_edges.csv
* https://aws-admartech-samples.s3.amazonaws.com/identity-resolution/data/transient_nodes.csv
* https://aws-admartech-samples.s3.amazonaws.com/identity-resolution/data/website_group_edges.csv
* https://aws-admartech-samples.s3.amazonaws.com/identity-resolution/data/website_group_nodes.csv
* https://aws-admartech-samples.s3.amazonaws.com/identity-resolution/data/websites.csv
 


## Graph model
Diagrams below show the dataset represented as a graph before and after manipulations.

Before:
![Untitled%20%281%29.svg](attachment:Untitled%20%281%29.svg)

After:
![extended.svg](attachment:extended.svg)

In short:
* more transient node attributes (type, user agents and its derivatives, email addresses) 
* IP & Location nodes (IP address, state, city) 
* identity group nodes (with type attribute)
* website groups (aka root domains with information about the content on the page, with IAB categories)
* more context to visiting page events

## Graph Statistics

The following commands will provide detail to the number of vertices and edges by label.

In [None]:
%%bash

pip --disable-pip-version-check install colorlover tqdm intervaltree sortedcontainers scipy --no-deps

In [None]:
from IPython import get_ipython
import json
ipython = get_ipython()
result = ipython.magic("graph_notebook_config")

In [None]:
import os

from nepytune.traversal import get_traversal

g = get_traversal(f"wss://{result.host}:8182/gremlin")

**Note: The following is going to do a count of all vertices and edges in the graph. This can take 2-3 minutes to return results.**

In [None]:
%%gremlin --store-to counts_of_vertices --silent

g.V().groupCount().by(T.label).toList()

In [None]:
%%gremlin --store-to counts_of_edges --silent

g.E().groupCount().by(T.label).toList()

Execute the following cell to visualize the counts of both vertices and edges:

In [None]:
from nepytune.visualizations.bar_plots import make_bars

chart1 = make_bars(counts_of_vertices[0], 
 "Number of Vertices per label",
 y_title="Number of vertices",
 x_title="Vertices per label",
 lazy=True)
chart1.show()

chart2 = make_bars(counts_of_edges[0], 
 "Number of Edges per label",
 y_title="Number of edges",
 x_title="Edges per label",
 lazy=True)
chart2.show()

In [None]:
import pprint

from nepytune.usecase import (
 user_summary, 
 undecided_users, 
 brand_interaction, 
 users_from_household, 
 purchase_path,
 similar_audience
)

pp = pprint.PrettyPrinter(indent=4)

## Use case queries
Each of the use case queries follows the same presentation pattern:
* define the query parameters
* show the traversal code
* draw part of referenced subgraph for the visual introspection of the query results
* describe the plot

The referenced subgraph is built using **networkx** package and rendered using **plotly** library. We use mostly networkx for computing positions, except for venn diagram visualisations where we compute the positions ourselves. You can find all the code with use-case query and subgraph generation in the `nepytune/usecase` package.

The subgraph query is distinct from the use case query as it often extracts more information for the purposes of the visualisation, and as a consequence is often less efficient. 

Some of the use-case query visualisations are explained to guide the reader and help with their understanding. 
The purpose is just to present the complexity of the dataset and to introspect the query results. Each of the plot is interactive, so feel free to play with it. 
 

## Use case 1) 
### Advertisers want to find out information about user interests to provide an accurate targeting. The data should be based on the activity of the user across all devices.

In [None]:
TRANSIENT_ID = "8ff869623f18f72f7cd073120dd905ec"

In [None]:
%%gremlin --store-to sibling_attrs --silent

g.V('${TRANSIENT_ID}')
 .choose(
 in("has_identity"), // check if this transient id has persistent id
 in("has_identity").
 project(
 "identity_group_id", "persistent_id", "attributes", "ip_location", "iab_categories"
 ).by(in("member").values("igid"))
 .by(values("pid"))
 .by(
 out("has_identity").valueMap().unfold()
 .group()
 .by(keys)
 .by(select(values).unfold().dedup().fold())
 )
 .by(
 out("has_identity")
 .out("uses").dedup().valueMap().fold()
 )
 .by(
 out("has_identity")
 .out("visited")
 .in("links_to")
 .values("categoryCode").dedup().fold()
 )
 , project(
 "identity_group_id", "persistent_id", "attributes", "ip_location", "iab_categories"
 ).by(constant(""))
 .by(constant(""))
 .by(
 valueMap().unfold()
 .group()
 .by(keys)
 .by(select(values).unfold().dedup().fold())
 )
 .by(
 out("uses").dedup().valueMap().fold()
 )
 .by(
 out("visited")
 .in("links_to")
 .values("categoryCode").dedup().fold()
 )
 )

In [None]:
pp.pprint(sibling_attrs)

In [None]:
g = get_traversal(f"wss://{result.host}:8182/gremlin")
user_summary.draw_refrenced_subgraph(g, TRANSIENT_ID)

#### Graph description

There are 6 node types on the graph:
* red one representing IP & Location node
* green one representing identity group node of type household
* violet one representing persistent identities (aka different users)
* orange one representing transient identities (aka various user devices)
* blue one representing website subpages
* pink one representing website root domains

The goal of this query is to find the summary of given persistent identity profile based on one of transient identities. Graph presented above shows part of the houshold (to which given transient identity belongs) activity on the web.

## Use case 2) 
### Ecommerce publishers want to convince undecided users to purchase the product by offering them discount codes as soon as they have met certain criteria. Find all users who have visited product page at least X times in the last 30 days, but did not buy anything (have not visited thank you page).

There are actually two questions we can formulate for created audiences: 
* who belongs to the audience? 
* does this user belong to the audience?

Each question above has separate implementation presented. The membership test query can be answered more efficiently and if the performance is good enough for the use case, might be used in a real-time application. 


In [None]:
import datetime

TRANSIENT_ID = "808d27e1fbe3016cf9523de320bfb1be"
WEBSITE_URL = "b23e286d713f61fd/f9077d4b41c9e32e"
THANK_YOU_PAGE_URL = "b23e286d713f61fd/f9077d4b41c9e32e/4b2b32d4f88fb014"
SINCE = datetime.datetime(2016, 6, 7)
SINCE_ISO = SINCE.isoformat()
MIN_VISITED_COUNT = 5

The following two queries check if a user is a member of an audience by looking at connections to a specific thank you page:

In [None]:
%%gremlin --store-to undecided_user_audience_check

g.V('${TRANSIENT_ID}')
 .hasLabel("transientId")
 .in("has_identity")
 .out("has_identity")
 .outE("visited")
 .has("ts", gt(datetime('${SINCE_ISO}')))
 .choose(
 has("visited_url",'${WEBSITE_URL}'),
 groupCount("visits").by(constant("page_visits"))
 )
 .choose(
 has("visited_url", '${THANK_YOU_PAGE_URL}'),
 groupCount("visits").by(constant("thank_you_page_vists"))
 )
 .cap("visits")
 .coalesce(
 and(
 coalesce(select("thank_you_page_vists"), constant(0)).is(0),
 select("page_visits").is(gt(${MIN_VISITED_COUNT}))
 ).choose(
 count().is(1),
 constant(true)
 ),
 constant(false)
 )

In [None]:
%%gremlin --store-to undecided_user_audience_check

g.V('${TRANSIENT_ID}')
 .hasLabel("transientId")
 .in("has_identity")
 .out("has_identity")
 .outE("visited")
 .has("ts", gt(datetime('${SINCE_ISO}')))
 .choose(
 has("visited_url",'${WEBSITE_URL}'),
 groupCount("visits").by(constant("page_visits"))
 )
 .choose(
 has("visited_url", '${THANK_YOU_PAGE_URL}-this-page-not-exist'),
 groupCount("visits").by(constant("thank_you_page_vists"))
 )
 .cap("visits")
 .coalesce(
 and(
 coalesce(select("thank_you_page_vists"), constant(0)).is(0),
 select("page_visits").is(gt(${MIN_VISITED_COUNT}))
 ).choose(
 count().is(1),
 constant(true)
 ),
 constant(false)
 )

The following query looks at all users in a given audience given landing on a product conversion / "thank you" page:

In [None]:
%%gremlin --store-to users

g.V('${WEBSITE_URL}')
 .hasLabel("website")
 .inE("visited").has("ts", gt(datetime('${SINCE_ISO}'))).outV()
 .in("has_identity")
 .groupCount()
 .unfold().dedup()
 .where(
 select(values).is(gt(${MIN_VISITED_COUNT}))
 )
 .select(keys).as("pids")
 .map(
 out("has_identity")
 .outE("visited")
 .has("visited_url", '${THANK_YOU_PAGE_URL}')
 .has("ts", gt(datetime('${SINCE_ISO}'))).outV()
 .in("has_identity").dedup()
 .values("pid").fold()
 ).as("pids_that_visited")
 .select("pids")
 .not(
 has("pid", where(within("pids_that_visited")))
 )
 .out("has_identity")
 .values("uid")

Run the following cell to visualize the audience:

In [None]:
g = get_traversal(f"wss://{result.host}:8182/gremlin")
undecided_users.draw_referenced_subgraph(g, WEBSITE_URL, THANK_YOU_PAGE_URL, SINCE, MIN_VISITED_COUNT)

#### Graph description

There are 4 node types on the graph:
* red one representing persistent identities (aka different users)
* violet one representing transient identities (aka various user devices) 
* green one representing website thank you page (page that is visited when user converts) 
* orange one representing website product page 

The goal of this query is to find all the user devices (violet nodes) that belongs to an audience (user (red node) belongs to an audience if he visited product page (orange node), but did not buy the product (green node) on any of its devices (violet node), and all this should happen in provided date interval - last X days).

The opaque user nodes (red ones) and user device nodes (violet ones) form the desired audience. 


## Use case 3) 
### Advertisers want to generate audiences for DSP platform targeting. Specific audience could be the users who are interested in specific car brands

Given website url, get its root node (root domain).

In [None]:
%%gremlin --store-to root_url

g.V('${WEBSITE_URL}')
 .hasLabel("website")
 .in("links_to")

Given website url, get all transitive (through persistent) identities
 that interacted with this brand on any of its pages.

In [None]:
%%gremlin --store-to brand_interaction_audience

g.V('${WEBSITE_URL}')
 .hasLabel("website")
 .in("links_to")
 .out("links_to") // get all websites from this root url
 .in("visited")
 .in("has_identity").dedup()
 .out("has_identity")
 .values("uid")

In [None]:
g = get_traversal(f"wss://{result.host}:8182/gremlin")
WEBSITE_URL = "b23e286d713f61fd"
print(f"Number of subpages: {g.V(WEBSITE_URL).in_('links_to').out('links_to').count().next()}")
users = brand_interaction.brand_interaction_audience(g, WEBSITE_URL).toList()
print(f"Audience size: {len(users)}")

In [None]:
g = get_traversal(f"wss://{result.host}:8182/gremlin")
brand_interaction.draw_referenced_subgraph(g, WEBSITE_URL)

#### Graph description

There are 4 node types on the graph:
* red one representing persistent identities (aka different users)
* green one representing transient identities (aka various user devices) 
* violet one representing website subpages 
* orange one representing website root domain

The goal of this query is to given one website subpage (violet) node, find all user devices (green nodes) that interacted with any of the subpages on that domain. 

In other words, we are interested in all user device or cookie identifiers (also the ones that did not explicitly interacted with the brand, but belongs to the same user), that showed up on any of the domain page. It might be usefull for the purposes of retargeting.

## Use case 4) 
### User has visited a travel agency website recently. Advertisers want to display ads about travel promotions to all members of his household.


Given transient id, get all transient ids from its household.

In [None]:
%%gremlin --store-to all_transient_ids_in_household

g.V('${TRANSIENT_ID}')
 .hasLabel("transientId")
 .in("has_identity")
 .in("member")
 .has("type", "household")
 .out("member")
 .out("has_identity")
.values("uid")

In [None]:
g = get_traversal(f"wss://{result.host}:8182/gremlin")
users_from_household.draw_referenced_subgraph(g, TRANSIENT_ID)

#### Graph description

There are 3 node types on the graph:
* red one representing identity group node of type household
* green one representing persistent identities (aka various users belonging to this household) 
* violet one representing transient identities (aka various user devices or different cookie IDs)

The goal of this query is to given one transient identifier in the household (violet node), find other transient identifiers that belongs to the same household.


## Use-Case 5)

### Marketing analyst wants to understand path to purchase of a new product by a few early adopters ( say 5) through interactive queries. This product is high involvement and expensive, and therefore they want to understand the research undertaken by the customer.

* Which device was used to initiate the first research. Was that prompted by an ad, email promotion?
* How many devices were used overall and what was the time taken from initial research to final purchase
* On which devices did the customer spend more time

In [None]:
SKIP_SINGLE_TRANSIENTS='true'
LIMIT=5

In [None]:
%%gremlin --store-to early_adopters_activities

 g.V('${THANK_YOU_PAGE_URL}')
 .hasLabel("website").as("thank_you")
 .in("links_to").as("website_group")
 .select("thank_you")
 .inE("visited")
 .order().by("ts")
 .choose(
 constant(${SKIP_SINGLE_TRANSIENTS}).is(eq(true)),
 where(outV().in("has_identity")),
 identity()
 )
 .choose(
 outV().in("has_identity"),
 project(
 "type", "id", "purchase_ts"
 )
 .by(constant("persistent"))
 .by(outV().in("has_identity"))
 .by(values("ts")),
 project(
 "type", "id", "purchase_ts"
 )
 .by(constant("transient"))
 .by(outV())
 .by(values("ts"))
 ).dedup("id").limit(${LIMIT})
 .choose(
 select("type").is("persistent"),
 project(
 "persistent_id", "transient_id", "purchase_ts"
 ).by(select("id").values("pid"))
 .by(select("id").out("has_identity").fold())
 .by(select("purchase_ts")),
 project("persistent_id", "transient_id", "purchase_ts")
 .by(constant(""))
 .by(select("id").fold())
 .by(select("purchase_ts"))
 ).project("persistent_id", "purchase_ts", "devices", "visits")
 .by(select("persistent_id"))
 .by(select("purchase_ts"))
 .by(select("transient_id").unfold().group().by(values("uid")).by(values("type")))
 .by(
 select("transient_id").unfold().outE("visited").order().by("ts")
 .where(
 inV().in("links_to").where(eq("website_group"))
 )
 .project(
 "transientId", "url", "ts"
 ).by("uid").by("visited_url").by("ts").fold())

In [None]:
pp.pprint(early_adopters_activities)

The following two cell are using Python to parse the content into a format for the subsequent visualizations. These cells may take a couple of minutes to execute:

In [None]:
activities = list(purchase_path.transform_activities(early_adopters_activities))

paths = list(
 purchase_path.compute_subgraph_pos(activities, THANK_YOU_PAGE_URL)
)

In [None]:
purchase_path.draw_referenced_subgraph(*paths[0])
purchase_path.draw_referenced_subgraph(*paths[1])

#### Graph description

There are 6 node types on each of the graph:
* purple ones representing user events
* red ones representing persistent identities (aka various users) 
* deep red ones representing transient identities (aka various user devices or different cookie IDs)
* blue ones representing user sessions
* orange ones representing thank you page
* green one representing visited webpage with dropped query strings

The goal of this query is to display the single user path to purchase the product and visually represent information like: 
* what webpages were visited
* how many devices were used
* display the session counts

In the graph above, there is one user that used single device to both do the research and make the purchase. User had 3 sessions and visited 6 subpages in total.

It's important to emphasise that in this dataset **it is not possible** to determine what domains represents actual products. We selected the domain and thank you page that seemed reasonable in terms of number of events and visited webpages.

In [None]:
purchase_path.draw_referenced_subgraph(*paths[2])
purchase_path.draw_referenced_subgraph(*paths[3])
purchase_path.draw_referenced_subgraph(*paths[4])

### Other descriptive graph statistics

Charts below display:
* time to purchase the product per each user
* all users session statistics where you can find out about:
 * the total time spent across all devices and sessions
 * duration of each session
* most common visited subpages before purchase splitted per single user (persistentId)

In [None]:
stats = purchase_path.generate_stats(activities)
plots = purchase_path.custom_plots(stats)
for plot in plots:
 plot.show()

## Use-Case 6)

### Identify look-alike customers for a product. 

* The goal here is to identify prospects, who show similar behavioral patterns as your existing customers. While we can easily do this algorithmically and automate this, the goal here is to provide visual query to improve human understanding to the marketing analysts. 

* What are the device ids from my customer graph, who are not yet buying my product (say Golf Club), but are show similar behavior patterns such lifestyle choices of buying golf or other sporting goods.


In [None]:
g = get_traversal(f"wss://{result.host}:8182/gremlin")
pp.pprint(similar_audience.recommend_similar_audience(g, THANK_YOU_PAGE_URL).toList())

In [None]:
g = get_traversal(f"wss://{result.host}:8182/gremlin")
similar_audience.draw_average_buyer_profile_pie_chart(g, THANK_YOU_PAGE_URL)

In [None]:
g = get_traversal(f"wss://{result.host}:8182/gremlin")
similar_audience.draw_referenced_subgraph(g, THANK_YOU_PAGE_URL)

#### Graph description

There are 4 node types:
* purple ones representing IAB categories
* blue ones representing persistent identities (aka various users) 
* red ones representing transient identities (aka various user devices or different cookie IDs)
* orange ones representing average buyer

The goal of this query is to display similar audience created from "average buyer".
Average buyer profile is created as mean of most popular categories across people who bought the product. In other words, we try to measure what are the user interests based on their activity and then we compute averaged profile composed of 3 best categories which can be used for futher reference. 

User belongs to an audience if any of his interests is at least as big as in referenced averaged profile. 

The opaque nodes represents users or user devices who either are not interested in this type of content, or are not interested enough.

## Visualisations 

## I want to know the most popular hours of users activity that belong to a given segment


In [None]:
from nepytune.visualizations import histogram, segments

WEBSITE_ID = "a997482113271d8f/5758f309e11931ce"

g = get_traversal(f"wss://{result.host}:8182/gremlin")
histogram.show(
 segments.get_all_devices_from_website_visitors(g, WEBSITE_ID).outE("visited").values("ts").toList(),
 website_name=WEBSITE_ID
)

## I want to know what are the most popular devices used by specific audience segment

Segment is built from users who are active between 16.00 and 18.00. 

In [None]:
from nepytune.visualizations import pie_chart, sunburst_chart, commons

g = get_traversal(f"wss://{result.host}:8182/gremlin")

conditions = commons.get_timerange_condition(g, start_hour=16, end_hour=18)

stats_activities = commons.get_user_device_statistics(g, conditions, limit=10000).next()

pie_chart.show(stats_activities)
sunburst_chart.show(stats_activities)

## I want to know the common part of specific segments

In [None]:
from nepytune.visualizations import segments, commons, venn_diagram

g = get_traversal(f"wss://{result.host}:8182/gremlin")
users_interested_in_content = segments.query_users_intersted_in_content(
 g, ['IAB11-3'], limit=3000
).toList()

g = get_traversal(f"wss://{result.host}:8182/gremlin")
users_active_between_16_18 = segments.query_users_active_in_given_date_intervals(
 g, commons.get_timerange_condition(g, start_hour=16, end_hour=18), limit=1000
).toList()

g = get_traversal(f"wss://{result.host}:8182/gremlin")
users_active_in_last_30_days = segments.query_users_active_in_n_days(g, n=30, limit=1000).toList()

venn_diagram.show_venn_diagram(
 venn_diagram.get_intersections(
 users_interested_in_content,
 users_active_between_16_18, 
 users_active_in_last_30_days
 ), labels=[
 "Users interested in specific content type",
 "Users active in the evenings",
 "Users who have visited website in past 30 days"
 ]
)

## Users that visited website on more than one device

In [None]:
from nepytune.visualizations import network_graph

WEBSITE_ID = "a997482113271d8f/5758f309e11931ce"
g = get_traversal(f"wss://{result.host}:8182/gremlin")
network_graph.show(g, WEBSITE_ID)
 
