{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "# Social Network Recommendations\n", "\n", "In this example we're going to build a powerful social network predictive capability with some simple Gremlin queries. The techniques intrdocued here can be used to build predictions in other domains outside of social.\n", "\n", "### People You May Know\n", "\n", "A common feature of many social network applications is the ability to recommend People-You-May-Know (or People-You-May-Want-To-Know) – sometimes abbreviated PYMK.\n", "\n", "Using Amazon Neptune, we can implement a PYMK capability using a well-understood phenomenon called *triadic closure*. Tridaic closure is the \n", "tendency for elements at a very local level in a graph to form stable triangles as the data changes over time. This behaviour can be observed in graphs in all kinds of different domains. It's the basis of many homophily-based recommendation systems – systems that exploit the fact that similarity breeds connections. In this example we're going to look at using triadic closure in the context of a social network.\n", "\n", "Let's imagine we have a social network whose members include Bill, Terry and Sarah. Terry is friends with both Bill and Sarah; that is, Terry and Sarah have a mutual friend in Bill. \n", "\n", "Because they have Bill in common, there's a good chance that Sarah and Bill either already know one other or may get to know one another in the near future. Just looking at the graph, we can see they have both the *means* and the *motive* to be friends. Hanging around with Bill provides the means for Sarah and Terry to meet. And because they trust Bill, they have the motive to trust people with whom Bill is friends, increasing the chance that if they do meet, they'll form a connection and close the triangle.\n", "\n", "In the context of a social network, we can use triadic closure to implement PYMK. When a particular user logs into the system, we can look up their vertex in the graph, and then traverse their friend-of-a-friend network, looking for opportunities to close triangles. The more paths that extend from our user, through their immediate friends, to someone to whom they are not currently connected, the greater the likelihood the user may either already know that person, or may benefit from getting to know them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup\n", "\n", "First, we'll reference a couple of helper libraries, _neptune.py_ and _visualisation.py_. Then we'll clear any existing data from our Neptune cluster, and create a refernce to `g`, a graph traversal source, which we then use in subsequent queries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%run '../util/neptune.py'\n", "%run '../util/visualisation.py'\n", "\n", "neptune.clear()\n", "g = neptune.graphTraversal()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we know which Neptune cluster to access? The methods above use a couple of environment variables, NEPTUNE_CLUSTER_ENDPOINT and NEPTUNE_CLUSTER_PORT, which were initialised when we first created this notebook instance.\n", "\n", "If you want to refer to a different Neptune cluster, you can supply the endpoint and port to the `clear()` and `graphTraversal()` methods:\n", "\n", "```\n", "neptune.clear(neptune_endpoint=, neptune_port=)\n", "g = neptune.graphTraversal(neptune_endpoint=, neptune_port=)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a Social Network\n", "\n", "Next, we'll create a small social network. Note that the script below comprises a single statement. All the vertices and edges here will be created in the context of a single transaction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "g. \\\n", "addV('User').property('name','Bill').property('birthdate', '1988-03-22'). \\\n", "addV('User').property('name','Sarah').property('birthdate', '1992-05-03'). \\\n", "addV('User').property('name','Ben').property('birthdate', '1989-10-21'). \\\n", "addV('User').property('name','Lucy').property('birthdate', '1998-01-17'). \\\n", "addV('User').property('name','Colin').property('birthdate', '2001-08-14'). \\\n", "addV('User').property('name','Emily').property('birthdate', '1998-03-05'). \\\n", "addV('User').property('name','Gordon').property('birthdate', '2002-12-04'). \\\n", "addV('User').property('name','Kate').property('birthdate', '1995-02-12'). \\\n", "addV('User').property('name','Peter').property('birthdate', '2001-02-27'). \\\n", "addV('User').property('name','Terry').property('birthdate', '1989-10-02'). \\\n", "addV('User').property('name','Alistair').property('birthdate', '1992-06-30'). \\\n", "addV('User').property('name','Eve').property('birthdate', '2000-05-13'). \\\n", "addV('User').property('name','Gary').property('birthdate', '1998-09-20'). \\\n", "addV('User').property('name','Mary').property('birthdate', '1997-01-27'). \\\n", "addV('User').property('name','Charlie').property('birthdate', '1989-11-02'). \\\n", "addV('User').property('name','Sue').property('birthdate', '1994-03-08'). \\\n", "addV('User').property('name','Arnold').property('birthdate', '2002-07-23'). \\\n", "addV('User').property('name','Chloe').property('birthdate', '1988-11-04'). \\\n", "addV('User').property('name','Henry').property('birthdate', '1996-03-15'). \\\n", "addV('User').property('name','Josie').property('birthdate', '2003-08-21'). \\\n", "V().hasLabel('User').has('name','Sarah').as_('a').V().hasLabel('User').has('name','Bill').addE('FRIEND').to('a').property('strength',1). \\\n", "V().hasLabel('User').has('name','Colin').as_('a').V().hasLabel('User').has('name','Bill').addE('FRIEND').to('a').property('strength',2). \\\n", "V().hasLabel('User').has('name','Terry').as_('a').V().hasLabel('User').has('name','Bill').addE('FRIEND').to('a').property('strength',3). \\\n", "V().hasLabel('User').has('name','Peter').as_('a').V().hasLabel('User').has('name','Colin').addE('FRIEND').to('a').property('strength',1). \\\n", "V().hasLabel('User').has('name','Kate').as_('a').V().hasLabel('User').has('name','Ben').addE('FRIEND').to('a').property('strength',2). \\\n", "V().hasLabel('User').has('name','Kate').as_('a').V().hasLabel('User').has('name','Lucy').addE('FRIEND').to('a').property('strength',3). \\\n", "V().hasLabel('User').has('name','Eve').as_('a').V().hasLabel('User').has('name','Lucy').addE('FRIEND').to('a').property('strength',1). \\\n", "V().hasLabel('User').has('name','Alistair').as_('a').V().hasLabel('User').has('name','Kate').addE('FRIEND').to('a').property('strength',2). \\\n", "V().hasLabel('User').has('name','Gary').as_('a').V().hasLabel('User').has('name','Colin').addE('FRIEND').to('a').property('strength',3). \\\n", "V().hasLabel('User').has('name','Gordon').as_('a').V().hasLabel('User').has('name','Emily').addE('FRIEND').to('a').property('strength',1). \\\n", "V().hasLabel('User').has('name','Alistair').as_('a').V().hasLabel('User').has('name','Emily').addE('FRIEND').to('a').property('strength',3). \\\n", "V().hasLabel('User').has('name','Terry').as_('a').V().hasLabel('User').has('name','Gordon').addE('FRIEND').to('a').property('strength',3). \\\n", "V().hasLabel('User').has('name','Alistair').as_('a').V().hasLabel('User').has('name','Terry').addE('FRIEND').to('a').property('strength',1). \\\n", "V().hasLabel('User').has('name','Gary').as_('a').V().hasLabel('User').has('name','Terry').addE('FRIEND').to('a').property('strength',2). \\\n", "V().hasLabel('User').has('name','Mary').as_('a').V().hasLabel('User').has('name','Terry').addE('FRIEND').to('a').property('strength',3). \\\n", "V().hasLabel('User').has('name','Henry').as_('a').V().hasLabel('User').has('name','Alistair').addE('FRIEND').to('a').property('strength',1). \\\n", "V().hasLabel('User').has('name','Sue').as_('a').V().hasLabel('User').has('name','Eve').addE('FRIEND').to('a').property('strength',2). \\\n", "V().hasLabel('User').has('name','Sue').as_('a').V().hasLabel('User').has('name','Charlie').addE('FRIEND').to('a').property('strength',3). \\\n", "V().hasLabel('User').has('name','Josie').as_('a').V().hasLabel('User').has('name','Charlie').addE('FRIEND').to('a').property('strength',1). \\\n", "V().hasLabel('User').has('name','Henry').as_('a').V().hasLabel('User').has('name','Charlie').addE('FRIEND').to('a').property('strength',2). \\\n", "V().hasLabel('User').has('name','Henry').as_('a').V().hasLabel('User').has('name','Mary').addE('FRIEND').to('a').property('strength',3). \\\n", "V().hasLabel('User').has('name','Mary').as_('a').V().hasLabel('User').has('name','Gary').addE('FRIEND').to('a').property('strength',1). \\\n", "V().hasLabel('User').has('name','Henry').as_('a').V().hasLabel('User').has('name','Gary').addE('FRIEND').to('a').property('strength',2). \\\n", "V().hasLabel('User').has('name','Chloe').as_('a').V().hasLabel('User').has('name','Gary').addE('FRIEND').to('a').property('strength',3). \\\n", "V().hasLabel('User').has('name','Henry').as_('a').V().hasLabel('User').has('name','Arnold').addE('FRIEND').to('a').property('strength',1). \\\n", "next()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is what the network looks like:\n", " \n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a Recommendation\n", "\n", "Let's now create a PYMK recommendation for a specific user.\n", "\n", "In the query below, we're finding the vertex that represents our user. We're then traversing `FRIEND` relationships (we don't care about relationship direction, so we're using `both()`) to find that user's immediate friends. We're then traversing another hop into the graph, looking for friends of those friends who _are not currently connected to our user_ (i.e., we're looking for the unclosed triangles).\n", "\n", "We then count the paths to these candidate friends, and order the results based on the number of times we can reach a candidate via one of the user's immediate friends." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "user = 'Terry'\n", "\n", "recommendations = (g.V().hasLabel('User').has('name', user).as_('user'). \n", " both('FRIEND').aggregate('friends'). \n", " both('FRIEND').\n", " where(P.neq('user')).where(P.without('friends')). \n", " groupCount().by('name'). \n", " order(Scope.local).by(Column.values, Order.decr).\n", " next())\n", "\n", "for key in recommendations.keys():\n", " print(key + ': ' + str(recommendations[key]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualise the Results\n", "\n", "Using _matplotlib_ we can visualise the results of a query.\n", "\n", "The sample code below uses the same traversal patterns as before to identify PYMK candidates, but instead of creating a resultset based on the number of times we reach a candidate, we create two projections: a list of candidate friends, and a list of paths to those candidate friends. We then use these lists to show the local network surrounding our user, with the PYML candidates highlighted in green.\n", "\n", "_visualisation.py_ contains the visualisation code. You can use this code as-is, or adapt it to your own needs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = g.V().hasLabel('User').has('name', user).as_('user'). \\\n", " both('FRIEND').aggregate('friends'). \\\n", " both('FRIEND'). \\\n", " where(P.neq('user')).where(P.without('friends')).aggregate('not-friends').by('name'). \\\n", " path().by('name').aggregate('p'). \\\n", " project('not-friends', 'paths'). \\\n", " by(__.select('not-friends').unfold().dedup().fold()). \\\n", " by(__.select('p')).next()\n", "\n", "notFriends = results['not-friends']\n", "paths = results['paths']\n", "\n", "def formatter(label, colors):\n", " if label in notFriends:\n", " colors.append('#11cc77')\n", " else:\n", " colors.append('yellow')\n", "\n", "visualisation.plotPaths(paths, formatter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Friendship Strength to Improve Recommendations\n", "\n", "What if we wanted to base our recommendations only on resonably strong friendship bonds?\n", "\n", "If you look at the Gremlin we used to create our graph, you'll see that each `FRIEND` edge has a `strength` property. In the following query, the traversal applies a predicate to this `strength` property. Note that we use `bothE()` rather than `both()` to position the traversal on an edge, where we then apply the predicate. We proceed only where `strength` is greater than one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "recommendations = (g.V().hasLabel('User').has('name', user).as_('user'). \n", " bothE('FRIEND'). \n", " has('strength', P.gt(1)).otherV(). \n", " aggregate('friends'). \n", " bothE('FRIEND'). \n", " has('strength', P.gt(1)).otherV(). \n", " where(P.neq('user')).where(P.without('friends')). \n", " groupCount().by('name'). \n", " order(Scope.local).by(Column.values, Order.decr).\n", " next())\n", "\n", "for key in recommendations.keys():\n", " print(key + ': ' + str(recommendations[key]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we discount weak friendships even when traversing to immediate friends, this query can sometimes end up recommending people that have a weak direct tie to our user. But that makes sense in the context of our social domain: one of our close friebnds has a strong friendship with one of the people with whom we have a weak connection; therefore, we might predict that over time this weak bond will grow stronger." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Subgraph Strategies to Filter Edges\n", "\n", "The code above required us to change the initial query in order to apply a predicate to every edge. As an alternative, we can use a [SubgraphStrategy](http://tinkerpop.apache.org/docs/current/reference/#_subgraphstrategy) to supply a filter that will be applied to every edge we traverse. Using the SubgraphStrategy, we can reuse our original query instead of having to rewrite it with `bothE()`.\n", "\n", "First, we create a new traversal source with a SubgraphStrategy applied:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gremlin_python.process.strategies import *\n", "\n", "g2 = g.withStrategies(SubgraphStrategy(edges=__.has('strength', P.gt(1))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we rerun the original query, but using the new traversal source (`g2` rather than `g`):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "recommendations = (g2.V().hasLabel('User').has('name', user).as_('user'). \n", " both('FRIEND').aggregate('friends'). \n", " both('FRIEND').\n", " where(P.neq('user')).where(P.without('friends')). \n", " groupCount().by('name'). \n", " order(Scope.local).by(Column.values, Order.decr).\n", " next())\n", "\n", "for key in recommendations.keys():\n", " print(key + ': ' + str(recommendations[key]))" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }