{ "cells": [ { "cell_type": "markdown", "id": "eab505f3", "metadata": {}, "source": [ "Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", "SPDX-License-Identifier: Apache-2.0\n", "\n", "# Learning openCypher - Variable Length Path Queries\n", "\n", "This notebook is the second in a series of notebooks that walk through how to write queries using openCypher. In this notebook, we will examine the basics of how to perform variable length path queries in openCypher. \n", "\n", "\n", "This notebook will build upon the items convered in the notebook \"01-Basic-Read-Queries\". If you have not loaded the data from those notebooks please follow the steps in the [Getting Started](#Getting-Started) section below. If you have loaded the data then you can jump ahead to the [Setting up the visualizations](#Setting-up-the-visualizations) section.\n", "\n" ] }, { "cell_type": "markdown", "id": "babcb3a8-77d8-4a58-a303-ced6dfda76be", "metadata": { "tags": [] }, "source": [ "## Getting Started \n", "\n", "For these notebooks, we will be leveraging a dataset from the book [Graph Databases in Action](https://www.manning.com/books/graph-databases-in-action?a_aid=bechberger) from Manning Publications. \n", "\n", "\n", "**Note:** These notebooks do not cover data modeling or building a data loading pipeline. If you would like a more detailed description about how this dataset is constructed and the design of the data model came from, then please read the book.\n", "\n", "To get started, the first step is to load data into the cluster. Assuming the cluster is empty, this can be accomplished by running the cell below which will load our Dining By Friends data.\n", "\n", "### Loading Data" ] }, { "cell_type": "code", "execution_count": null, "id": "2273d5e6-2769-472e-b458-7f738f9831ad", "metadata": {}, "outputs": [], "source": [ "%seed --model Property_Graph --dataset dining_by_friends --run" ] }, { "attachments": { "image-3.png": { "image/png": "" } }, "cell_type": "markdown", "id": "48e9a85f", "metadata": {}, "source": [ "### Looking at our graph data\n", "\n", "Now that we have loaded our data, let's take a moment to look at what our data model looks like:\n", "\n", "\n", "![image-3.png](attachment:image-3.png)\n", "\n", "\n", " \n", " \n", "\n", " \n", "
Element (Node/Edge) Counts
\n", " \n", "|Node Label|Count|\n", "|:--|:--|\n", "|review|109|\n", "|restaurant|40|\n", "|cuisine|24|\n", "|person|8|\n", "|state|2|\n", "|city|2|\n", " \n", "\n", "\n", "|Edge Label|Count|\n", "|:--|:--|\n", "|wrote|218|\n", "|about|218|\n", "|within|84|\n", "|serves|80|\n", "|friends|20|\n", "|lives|16|\n", "\n", "
\n", "\n", "This dataset represents a fictitious, but realistic, restaurant recommendation application that contains:\n", "\n", "* Users, represented by `person` nodes\n", "* Users connected to Users via `friends` edges\n", "* Restaurants and their associated information (`city`, `state`, `cusine`)\n", "* Reviews include the body and ratings\n", "* Ratings of reviews (helpful/not helpful)\n", "\n", "This application contains three main aspects to the data it collects. First, it contains a social network consisting of `person` nodes connected to other `person` nodes via a `friends` edge. Second, it contains a restaurant review aspect consisting of `restaurant` nodes, information about those restaurants (`city`/`state`/`cuisine`), and `review` nodes for that restaurant. The third, and final aspect, consists of a personalization component where a `person` can rate a `review`, which allows for better recommendations based on a person's preferences.\n", "\n", "Throughout this set of notebooks, we will leverage the different aspects of this data to highlight different fundamental types of common property graph queries, namely neighborhood traversals, hierarchies, paths, and collaborative filtering.\n", "\n", "Now let's get started." ] }, { "cell_type": "markdown", "id": "0c12469c", "metadata": {}, "source": [ "### Setting up the visualizations\n", "\n", "Run the next two cells to configure various display options for our notebook, which we will use later on to display our results in a pleasing visual way. " ] }, { "cell_type": "code", "execution_count": null, "id": "6e655017", "metadata": {}, "outputs": [], "source": [ "%%graph_notebook_vis_options\n", "{\n", " \"groups\": { \n", " \"person\": {\n", " \"color\": \"#9ac7bf\"\n", " },\n", " \"review\": {\n", " \"color\": \"#f8cecc\"\n", " },\n", " \"city\": {\n", " \"color\": \"#d5e8d4\"\n", " },\n", " \"state\": {\n", " \"color\": \"#dae8fc\"\n", " },\n", " \"review_rating\": {\n", " \"color\": \"#e1d5e7\"\n", " },\n", " \"restaurant\": {\n", " \"color\": \"#ffe6cc\"\n", " },\n", " \"cusine\": {\n", " \"color\": \"#fff2cc\"\n", " }\n", " }\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "d5c80800", "metadata": {}, "outputs": [], "source": [ "node_labels = '{\"person\":\"first_name\",\"city\":\"name\",\"state\":\"name\",\"restaurant\":\"name\",\"cusine\":\"name\"}'" ] }, { "cell_type": "markdown", "id": "ab986dac", "metadata": {}, "source": [ "\n", "## Variable Length Paths\n", "\n", "When working with any property graph, some of the most powerful queries you can write are ones where the number of connections between a source and a target entity is not known. These types of queries are so common that property graph query languages, such as openCypher, have first class support as a key piece of the query language. In openCypher, these queries are written using a mechanism known as Variable Length Path patterns or VLPs. VLPs allow us to specify a sequence of nodes and relationships, as well as the number of times to repeat the relationship in the pattern matching syntax. \n", "\n", "In openCypher, a basic VLP query to find all nodes within 1 to 3 hops looks like:\n", "\n", "```\n", "MATCH p=(:person)-[:friends*1..3]->(:person)\n", "RETURN p\n", "```\n", "\n", "Examining this query we see that, while this looks familiar, there are a few new elements to the relationship syntax to highlight, specifically the `*1..3` portion. This portion begins with an asterisk(`*`) indicating that this is a VLP query. The next number represents the minimum length of the path, which is followed by two periods (`..`) and the maximum length of the path. \n", "\n", "While this is the basic pattern for VLP queries, there are several different variants which are shown in the table below: \n", "\n", "#### Variable Length Path Syntax\n", "\n", "| VLP Pattern|Description|\n", "|:--|:--|\n", "|`()-[*2]->()`|Find me a path containing 3 nodes and 2 edges|\n", "|`()-[*2..3]->()`|Find me a path containing a minimum of 3 nodes and 2 edges and a maximum of 4 nodes and 3 relationships|\n", "|`()-[*2..]->()`|Find me a path containing a minimum of 3 nodes and 2 edges, with no maximum|\n", "|`()-[*..2]->()`|Find me a path containing a maximum of 3 nodes and 2 edges, with no minimum|\n", "|`()-[*]->()`|Find me a path with no minimum or maximum|\n", "\n", "\n", "Now that we have a basic understanding of openCypher's VLP syntax, let's look at how this is applied to answer some common graph query patterns.\n", "\n", "### Static Length paths\n", "\n", "The simplest VLP pattern you can do in openCypher is to specify a fixed number of loops/iterations for your pattern. This is accomplished using the syntax `()-[:friends*2]->()`. Let's execute the query below to search for patterns containing 3 nodes and 2 edges." ] }, { "cell_type": "code", "execution_count": null, "id": "deabe58e", "metadata": {}, "outputs": [], "source": [ "%%oc -d $node_labels\n", "MATCH p=()-[:friends*2]->()\n", "RETURN p \n", "LIMIT 10 " ] }, { "cell_type": "markdown", "id": "4eee92c8", "metadata": {}, "source": [ "Looking at the above query, you'll notice that this seems familiar to our Friends of Friends query from the last notebook. In truth, this query can be written using either VLP syntax or as we previously learned (`MATCH p=()-[:friends]->()-[:friends]->()`). Why would you want to use VLP syntax here?\n", "\n", "The main reason you might prefer VLP syntax here is that it can provide a significantly more readable query to the user. While the queries we are currently looking at are very straightforward, as they get more complex, and the patterns being matched get longer, it can be very helpful to have a shorter, more concise query to make it easier to understand what the expectations of the query. For example, the VLP pattern:\n", "\n", "```\n", "MATCH p=()-[:friends*6]->()\n", "```\n", "Is much more understandable than the non-VLP pattern:\n", "\n", "`MATCH p=()-[:friends]->()-[:friends]->()-[:friends]->()-[:friends]->()-[:friends]->()-[:friends]->()`\n", "\n", "Even though they are functionally equivalent.\n", "\n", "### A range of lengths\n", "\n", "While the example above works on a static length of paths, sometimes we do not know the number of connections we need to traverse to answer a question. In this case, we can use the minimum and maximum range parameters of our VLP pattern to specify the range of potential connections.\n", "\n", "#### Minimum length\n", "Let's take the query from the last cell and modify it to specify only the minimum length of the pattern instead of a fixed length. We accomplish this by replacing the `*2` with a `*2..`. \n", "\n", "\n", "Execute the query below to see how many paths are connected via a minimum of 2 `friends` edges." ] }, { "cell_type": "code", "execution_count": null, "id": "168bcc76", "metadata": {}, "outputs": [], "source": [ "%%oc -d $node_labels\n", "MATCH p=()-[:friends*2..]->()\n", "RETURN p \n", "LIMIT 10 " ] }, { "cell_type": "markdown", "id": "6d36e1dd", "metadata": {}, "source": [ "#### Maximum length\n", "Now, let's modify that same query to specify only the maximum length of the pattern, by adding `*..2`. \n", "\n", "Execute the query below to see how many paths are connected via a maximum of 2 `friends` edges." ] }, { "cell_type": "code", "execution_count": null, "id": "55f2aaf9", "metadata": {}, "outputs": [], "source": [ "%%oc -d $node_labels\n", "MATCH p=()-[:friends*..2]->()\n", "RETURN p \n", "LIMIT 10 " ] }, { "cell_type": "markdown", "id": "6f5ef9ca", "metadata": {}, "source": [ "#### Minimum and Maximum length\n", "Let's combine these last 2 queries together to specify both the minimum and the maximum length of the pattern, by adding `*1..2`. Execute the query below to see how many paths are connected via a minimum of 1 `friends` edge and maximum of 2 `friends` edges.\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "1c3736ac", "metadata": {}, "outputs": [], "source": [ "%%oc -d $node_labels\n", "MATCH p=()-[:friends*1..2]->()\n", "RETURN p \n", "LIMIT 10 " ] }, { "cell_type": "markdown", "id": "df52d879", "metadata": {}, "source": [ "#### Unbounded Patterns\n", "The final way to use VLP syntax is to not specify any minimum or maximum length and instead go for an unbounded range on the pattern. This is accomplished by adding just the asterisk (`*`). \n", "\n", "\n", "**Important:** While this is a valid query, these sorts of queries tend to have a very high latency as they may traverse/touch a large portion of the graph, depending on how the graph is connected.\n", "\n", "Execute the query below to see how many paths are connected via any number of `friends` edges. " ] }, { "cell_type": "code", "execution_count": null, "id": "651a0f21", "metadata": {}, "outputs": [], "source": [ "%%oc -d $node_labels\n", "MATCH p=()-[:friends*]->()\n", "RETURN p \n", "LIMIT 10 " ] }, { "cell_type": "markdown", "id": "85de2ce8", "metadata": {}, "source": [ "## Exercises\n", "\n", "Now that we have gone through the basics of VLP queries in openCypher, it's time to put it into practice. Below are several exercises you can complete to verify your understanding of the material covered in this notebook. As practice for what you have learned, please write the openCypher queries specified below.\n", "\n", "### Exercise VLP-1 Find the friends of Dave's Friends using a VLP\n", "\n", "Using the data model above, write a query that will:\n", "\n", "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", "* Find the friends of Dave (i.e. traverse the `friends` edge)\n", "* Find the friends of that person (i.e. traverse the `friends` edge)\n", "* Return the friends `first_name`\n", "\n", "The correct answer is three results: \"Hank\", \"Denise\", \"Paras\"" ] }, { "cell_type": "code", "execution_count": null, "id": "3d1cc0f0", "metadata": {}, "outputs": [], "source": [ "%%oc -d $node_labels\n" ] }, { "cell_type": "markdown", "id": "6521e66f", "metadata": {}, "source": [ "### Exercise VLP-2 Find all `person` nodes connected to Dave\n", "\n", "Starting at a single node and trying to find all connected children (a.k.a. root to leaf) or trying to find the parent of any child node (a.k.a leaf to root) are two very common hierarchical graph query patterns. Commonly, these queries supported bill of materials, information organization, or compliance use cases.\n", "\n", "In this exercise, we will be applying that same query pattern to find the hierarchy of people within our social network. We'll accomplish this vby writing a \"root to leaf\" type query where the root node is our `Dave` node in the social network.\n", "\n", "Using the data model above, write a query that will:\n", "\n", "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", "* Keep traversing the outgoing `friends` edge until there are no more outgoing `friends` edges\n", "* Return all the paths\n", "\n", "The correct answer is nine results" ] }, { "cell_type": "code", "execution_count": null, "id": "10b4aa1f", "metadata": {}, "outputs": [], "source": [ "%%oc -d $node_labels\n" ] }, { "cell_type": "markdown", "id": "bda5cbf3", "metadata": {}, "source": [ "### Exercise VLP-3 Find if Dave and Denise are connected\n", "\n", "Attempting to see if, and how two entities in a graph are connected is a common path type query pattern. These types of queries containing unbounded path traversals with an OLTP graph database are best for calculating point to point or point to set path questions. Set to set or all pairs paths are best done with graph algorithms.\n", "\n", "In this exercise, we will be applying a path type query pattern to find out if two people are connected.\n", "\n", "Using the data model above, write a query that will:\n", "\n", "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", "* Find the friends of Dave (i.e. traverse the `friends` edge)\n", "* Keep traversing the `friends` edge until you find `Denise`\n", "* Return a single `True` as the result\n", "\n", "The correct answer is a single result: `True`" ] }, { "cell_type": "code", "execution_count": null, "id": "b44692f8", "metadata": {}, "outputs": [], "source": [ "%%oc -d $node_labels\n" ] }, { "cell_type": "markdown", "id": "0ce0b6c8", "metadata": {}, "source": [ "### Exercise VLP-4 Find all the ways Dave and Denise are connected\n", "\n", "A common extension to the path traversal query we wrote in VLP-3 is to return not just \"if\" someone is connected but \"how\" they are connected.\n", "\n", "In this exercise, we will be making a slight modification to the previous query to return \"how\" Dave and Denise are connected, not just that they are.\n", "\n", "Using the data model above, write a query that will:\n", "\n", "* Find a `person` node(s) with a `first_name` of \"Dave\"\n", "* Find the friends of Dave (i.e. traverse the `friends` edge)\n", "* Keep traversing the `friends` edge until you find `Denise`\n", "* Return the path\n", "\n", "The correct answer has fifteen results" ] }, { "cell_type": "code", "execution_count": null, "id": "007a9efd", "metadata": {}, "outputs": [], "source": [ "%%oc -d $node_labels\n" ] }, { "cell_type": "markdown", "id": "b5acefc5", "metadata": {}, "source": [ "## Conclusion\n", "\n", "In this notebook, we explored writing variable length path queries in openCypher queries. These queries are a powerful and common way to explore connected data to answer questions, especially those where the exact number of connection is unknown. \n", "\n", "In the next notebook, we will take what we have learned in this notebook and extend it to demonstrate how to order, group, and aggregate values in queries." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 5 }