{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Air Routes\n", "\n", "The examples in this notebook demonstrate using the GremlinPython library to connect to and work with a Neptune instance. Using a Jupyter notebook in this way provides a nice way to interact with your Neptune graph database in a familiar and instantly productive environment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the Air Routes dataset\n", "\n", "When the SageMaker notebook instance was created the appropriate Python libraries for working with a Tinkerpop enabled graph were installed. We now need to `import` some classes from those libraries before connecting to our Neptune instance, loading some sample data, and running queries. \n", "\n", "The `neptune.py` helper module that was installed in the _util_ directory does all the necessary heavy lifting with regard to importing classes and loading the air routes dataset. You can reuse this module in your own notebooks, or consult its source code to see how to configure GremlinPython." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%run '../util/neptune.py'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the neptune module, we can clear any existing data from the database, and load the air routes graph:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "neptune.clear()\n", "neptune.bulkLoad('s3://aws-neptune-customer-samples-${AWS_REGION}/neptune-sagemaker/data/let-me-graph-that-for-you/01-air-routes/', interval=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Establish access to our Neptune instance\n", "\n", "Before we can work with our graph we need to establish a connection to it. This is done using the `DriverRemoteConnection` capability as defined by Apache TinkerPop and supported by GremlinPython. The `neptune.py` helper module facilitates creating this connection.\n", "\n", "Once this cell has been run we will be able to use the variable `g` to refer to our graph in Gremlin queries in subsequent cells. By default Neptune uses port 8182 and that is what we connect to below. When you configure your own Neptune instance you can you choose a different endpoint and port number by specifiying the `neptune_endpoint` and `neptune_port` parameters to the `graphTraversal()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "g = neptune.graphTraversal()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's find out a bit about the graph\n", "\n", "Let's start off with a simple query just to make sure our connection to Neptune is working. The queries below look at all of the vertices and edges in the graph and create two maps that show the demographic of the graph. As we are using the air routes data set, not surprisingly, the values returned are related to airports and routes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vertices = g.V().groupCount().by(T.label).toList()\n", "edges = g.E().groupCount().by(T.label).toList()\n", "print(vertices)\n", "print(edges)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find routes longer than 8,400 miles\n", "\n", "The query below finds routes in the graph that are longer than 8,400 miles. This is done by examining the `dist` property of the `routes` edges in the graph. Having found some edges that meet our criteria we sort them in descending order by distance. The `where` step filters out the reverse direction routes for the ones that we have already found beacuse we do not, in this case, want two results for each route. As an experiment, try removing the `where` line and observe the additional results that are returned. Lastly we generate some `path` results using the airport codes and route distances. Notice how we have laid the Gremlin query out over multiple lines to make it easier to read. To avoid errors, when you lay out a query in this way using Python, each line must end with a backslash character \"\\\".\n", "\n", "The results from running the query will be placed into the variable `paths`. Notice how we ended the Gremlin query with a call to `toList`. This tells Gremlin that we want our results back in a list. We can then use a Python `for` loop to print those results. Each entry in the list will itself be a list containing the starting airport code, the length of the route and the destination airport code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "paths = g.V().hasLabel('airport').as_('a') \\\n", " .outE('route').has('dist',gt(8400)) \\\n", " .order().by('dist',Order.decr) \\\n", " .inV() \\\n", " .where(P.lt('a')).by('code') \\\n", " .path().by('code').by('dist').by('code').toList()\n", "\n", "for p in paths:\n", " print(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Draw a Bar Chart that represents the routes we just found.\n", "\n", "One of the nice things about using Python to work with our graph is that we can take advantage of the larger Python ecosystem of libraries such as `matplotlib`, `numpy` and `pandas` to further analyze our data and represent it pictorially. So, now that we have found some long airline routes we can build a bar chart that represents them graphically." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt; plt.rcdefaults()\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", " \n", "routes = list()\n", "dist = list()\n", "\n", "# Construct the x-axis labels by combining the airport pairs we found\n", "# into strings with with a \"-\" between them. We also build a list containing\n", "# the distance values that will be used to construct and label the bars.\n", "for i in range(len(paths)):\n", " routes.append(paths[i][0] + '-' + paths[i][2])\n", " dist.append(paths[i][1])\n", "\n", "# Setup everything we need to draw the chart\n", "y_pos = np.arange(len(routes))\n", "y_labels = (0,1000,2000,3000,4000,5000,6000,7000,8000,9000)\n", "freq_series = pd.Series(dist) \n", "plt.figure(figsize=(11,6))\n", "fs = freq_series.plot(kind='bar')\n", "fs.set_xticks(y_pos, routes)\n", "fs.set_ylabel('Miles')\n", "fs.set_title('Longest routes')\n", "fs.set_yticklabels(y_labels)\n", "fs.set_xticklabels(routes)\n", "fs.yaxis.set_ticks(np.arange(0, 10000, 1000))\n", "fs.yaxis.set_ticklabels(y_labels)\n", "\n", "# Annotate each bar with the distance value\n", "for i in range(len(paths)):\n", " fs.annotate(dist[i],xy=(i,dist[i]+60),xycoords='data',ha='center')\n", "\n", "# We are finally ready to draw the bar chart\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explore the distribution of airports by continent\n", "\n", "The next example queries the graph to find out how many airports are in each continent. The query starts by finding all vertices that are continents. Next, those vertices are grouped, which creates a map (or dict) whose keys are the continent descriptions and whose values represent the counts of the outgoing edges with a 'contains' label. Finally the resulting map is sorted using the keys in ascending order. That result is then returned to our Python code as the variable `m`. Finally we can print the map nicely using regular Python concepts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Return a map where the keys are the continent names and the values are the\n", "# number of airports in that continent.\n", "m = g.V().hasLabel('continent') \\\n", " .group().by('desc').by(__.out('contains').count()) \\\n", " .order(Scope.local).by(Column.keys) \\\n", " .next()\n", "\n", "for c,n in m.items():\n", " print('%4d %s' %(n,c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Draw a pie chart representing the distribution by continent\n", "\n", "Rather than return the results as text like we did above, it might be nicer to display them as percentages on a pie chart. That is what the code in the next cell does. Rather than return the descriptions of the continents (their names) this time our Gremlin query simply retrieves the two digit character code representing each continent." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt; plt.rcdefaults()\n", "import numpy as np\n", "\n", "# Return a map where the keys are the continent codes and the values are the\n", "# number of airports in that continent.\n", "m = g.V().hasLabel('continent').group().by('code').by(__.out().count()).next()\n", "\n", "fig,pie1 = plt.subplots()\n", "\n", "pie1.pie(m.values() \\\n", " ,labels=m.keys() \\\n", " ,autopct='%1.1f%%'\\\n", " ,shadow=True \\\n", " ,startangle=90 \\\n", " ,explode=(0,0,0.1,0,0,0,0))\n", "\n", "pie1.axis('equal') \n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Find some routes from London to San Jose and draw them\n", "\n", "One of the nice things about connected graph data is that it lends itself nicely to visualization that people can get value from looking at. The Python `networkx` library makes it fairly easy to draw a graph. The next example takes advantage of this capability to draw a directed graph (DiGraph) of a few airline routes.\n", "\n", "The query below starts by finding the vertex that represents London Heathrow (LHR). It then finds 15 routes from LHR that end up in San Jose California (SJC) with one stop on the way. Those routes are returned as a list of paths. Each path will contain the three character IATA codes representing the airports found.\n", "\n", "The main purpose of this example is to show that we can easily extract part of a larger graph and render it graphically in a way that is easy for an end user to comprehend." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt; plt.rcdefaults()\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import networkx as nx\n", "\n", "# Find up to 15 routes from LHR to SJC that make one stop.\n", "paths = g.V().has('airport','code','LHR') \\\n", " .out().out().has('code','SJC').limit(15) \\\n", " .path().by('code').toList()\n", "\n", "# Create a new empty DiGraph\n", "G=nx.DiGraph()\n", "\n", "# Add the routes we found to DiGraph we just created\n", "for p in paths:\n", " G.add_edge(p[0],p[1])\n", " G.add_edge(p[1],p[2])\n", "\n", "# Give the starting and ending airports a different color\n", "colors = []\n", "\n", "for label in G:\n", " if label in['LHR','SJC']:\n", " colors.append('yellow')\n", " else:\n", " colors.append('#11cc77')\n", "\n", "# Now draw the graph \n", "plt.figure(figsize=(5,5))\n", "nx.draw(G, node_color=colors, node_size=1200, with_labels=True)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# PART 2 - Examples that use iPython Gremlin\n", "\n", "This part of the notebook contains examples that use the iPython Gremlin Jupyter extension to work with a Neptune instance using Gremlin." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configuring iPython Gremlin to work with Neptune\n", "\n", "Before we can start to use iPython Gremlin we need to load the Jupyter Kernel extension and configure access to our Neptune endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Create a string containing the full Web Socket path to the endpoint\n", "# Replace with the name of your Neptune instance.\n", "# which will be of the form myinstance.us-east-1.neptune.amazonaws.com\n", "\n", "#neptune_endpoint = ''\n", "import os\n", "neptune_endpoint = os.environ['NEPTUNE_CLUSTER_ENDPOINT']\n", "neptune_port = os.environ['NEPTUNE_CLUSTER_PORT']\n", "neptune_gremlin_endpoint = 'wss://' + neptune_endpoint + ':' + neptune_port + '/gremlin'\n", "\n", "# Load the iPython Gremlin extension and setup access to Neptune.\n", "%load_ext gremlin\n", "%gremlin.connection.set_current $neptune_gremlin_endpoint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run this cell if you need to reload the Gremlin extension.\n", "Occaisionally it becomes necessary to reload the iPython Gremlin extension to make things work. Running this cell will do that for you." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Re-load the iPython Gremlin Jupyter Kernel extension.\n", "%reload_ext gremlin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A simple query to make sure we can connect to the graph. \n", "\n", "Find all the airports in England that are in London. Notice that when using iPython Gremlin you do not need to use a terminal step such as `next` or `toList` at the end of the query in order to get it to return results. As mentioned earlier in this post, the `%reset -f` is to work around a known issue with iPython Gremlin." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%reset -f\n", "%gremlin g.V().has('airport','region','GB-ENG') \\\n", " .has('city','London').values('desc')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### You can store the results of a query in a variable just as when using Gremlin Python.\n", "The query below is the same as the previous one except that the results of running the query are stored in the variable 'places'. We can then work with that variable in our code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%reset -f\n", "places = %gremlin g.V().has('airport','region','GB-ENG') \\\n", " .has('city','London').values('desc')\n", "for p in places:\n", " print(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Treating entire cells as Gremlin\n", "Any cell that begins with `%%gremlin` tells iPython Gremlin to treat the entire cell as Gremlin. You cannot mix Python code into these cells." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%gremlin\n", "g.V().has('city','London').has('region','GB-ENG').count()\n" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }