Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: Apache-2.0

# Building a Fraud Graph Application on Amazon Neptune

This notebook shows how Amazon Neptune can be used to build a fraud graph for a fraud detection solution. It includes a credit card transaction data model that connects accounts, merchants and transactions, and Gremlin queries that help identify fraud rings (first-party fraud) and instances of identity theft (third-party fraud).

  - [Background](#Background)
  - [Getting Started](#Getting-Started)
  - [Fraud Rings](#Fraud-Rings)
  - [Identity Theft](#Identity-Theft)
  - [Building a Fraud Detection Solution](#Building-a-Fraud-Detection-Solution)
  - [Conclusion](#Conclusion)
  - [What's Next?](#What's-Next?)

## Background

Fraud hides itself in isolation. It exploits our failure to assess a transaction in the context of other recent transactions: patterns of fraudulent behaviour emerge only when we connect many seemingly discrete data points and events.

A fraud graph connects the entities participating in retail and financial transactions: entities such as accounts, transactions and merchants. By relating accounts and connecting transactions we improve our chances to detect and prevent fraud. With Amazon Neptune we can connect account and transaction information and query it whenever a new account or transaction is submitted to the system. Using queries that find patterns in our data that we know to be _indicative_ of fraud, we can evaluate each transaction in the context of other transactions and accounts, and thereby determine whether constellations of data in the fraud graph represent fraudulent activity.

The examples in this use case show how we can identify fraud rings (first-party fraud) and instances of identity theft (third-party fraud) in a credit card dataset.

  - **Fraud ring** A group of people who give false information when applying for a credit card, with the intention of purchasing goods and services without repaying the debt.
  - **Identity theft** When an account holder's personal details are used without their permission to purchase goods or services.
  
Most of the queries in this notebook are _graph local_ queries. A graph local query takes as its starting point an individual entity - an account, or transaction, for example - and from there explores the neighboring parts of the graph in order to compute a result or discover a local constellation of connected data. Fraud detection solutions use graph local queries when accounts are created or modified, or new transactions are submitted to the system, to identify fraudulent behaviour associated with the account or transaction.

## Getting Started

In this section we'll load the fraud graph and set some visualization options. We'll then use some Gremlin queries to inspect the data model used throughout the solution.

### Load data

The cell below loads the example fraud graph into your Neptune cluster. When you run the cell you will be prompted to select a `Source type`, a `Data Model`, and a `Data set`. Select `samples`, `Property_Graph`, and `fraud_graph`, respectively. The graph takes about 5 minutes to load.

In [None]:
%seed --model Property_Graph --language gremlin --dataset fraud_graph --run

### Set visualization options

The command below configures the visualization to use specific colours and icons for the different parts of the data model.

In [None]:
%%graph_notebook_vis_options

{
  "groups": {
    "Account": {
      "shape": "icon",
      "icon": {
        "face": "FontAwesome",
        "code": "\uf2bb",
        "color": "red"
      }
    },
    "Transaction": {
      "shape": "icon",
      "icon": {
        "face": "FontAwesome",
        "code": "\uf155",
        "color": "green"
      }
    },
    "Merchant": {
      "shape": "icon",
      "icon": {
        "face": "FontAwesome",
        "code": "\uf290",
        "color": "orange"
      }
    },
    "DateOfBirth": {
      "shape": "icon",
      "icon": {
        "face": "FontAwesome",
        "code": "\uf1fd",
        "color": "blue"
      }
    },
    "EmailAddress": {
      "shape": "icon",
      "icon": {
        "face": "FontAwesome",
        "code": "\uf1fa",
        "color": "blue"
      }
    },
    "Address": {
      "shape": "icon",
      "icon": {
        "face": "FontAwesome",
        "code": "\uf015",
        "color": "blue"
      }
    },
    "IpAddress": {
      "shape": "icon",
      "icon": {
        "face": "FontAwesome",
        "code": "\uf109",
        "color": "blue"
      }
    },
    "PhoneNumber": {
      "shape": "icon",
      "icon": {
        "face": "FontAwesome",
        "code": "\uf095",
        "color": "blue"
      }
    }
  },
  "edges": {
    "color": {
      "inherit": false
    },
    "smooth": {
      "enabled": true,
      "type": "straightCross"
    },
    "arrows": {
      "to": {
        "enabled": false,
        "type": "arrow"
      }
    },
    "font": {
      "face": "courier new"
    }
  },
  "interaction": {
    "hover": true,
    "hoverConnectedEdges": true,
    "selectConnectedEdges": false
  },
  "physics": {
    "minVelocity": 0.75,
    "barnesHut": {
      "centralGravity": 0.1,
      "gravitationalConstant": -50450,
      "springLength": 95,
      "springConstant": 0.04,
      "damping": 0.09,
      "avoidOverlap": 0.1
    },
    "solver": "barnesHut",
    "enabled": true,
    "adaptiveTimestep": true,
    "stabilization": {
      "enabled": true,
      "iterations": 1
    }
  }
}

### Data model

The fraud graph included in this example models credit card accounts, account holder information, merchants, and the transactions performed when an account holder purchases goods or services from a merchant.

#### Account and features

An `Account` has a number of features, including physical `Address`, `IpAddress`, `DateOfBirth` of the account holder, `EmailAddress`, and contact `PhoneNumber`. An account holder can have multiple email addresses and phone numbers.

In many graph data models these features of the account holder would be modelled as properties of the account. But with fraud detection it's important to be able to link accounts based on shared features, and to find related accounts at query time based on one or more shared features. Hence, our fraud detection application graph data model stores each feature as a separate vertex. Multiple accounts that share the same feature value - the same physical address, for example - are connected to the single vertex representing that feature value. For more details on modelling shared features as vertices, see [Relating entities through their attributes at query time](https://github.com/aws-samples/aws-dbs-refarch-graph/tree/master/src/graph-data-modelling#relating-entities-through-their-attributes-at-query-time).

The following query shows a single account and its associated features. After running the query, click the `Graph` tab to see a visualization of the results. Note that the account holder shown in the results has two email addresses and two phone numbers.

In [None]:
%%gremlin -g type -p v,inV,outV

g.V('account-4398046521820').
  in('FEATURE_OF_ACCOUNT').
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'first_name', 'last_name', 'value'))
  )

The next query shows two accounts that share the same feature - a date of birth. After running the query, click the `Graph` tab to see a visualization of the results.

In [None]:
%%gremlin -g type -p v,inV,outV

g.V('account-8853').
  in('FEATURE_OF_ACCOUNT').
  out('FEATURE_OF_ACCOUNT').
  simplePath().
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value', 'amount', 'created', 'name'))
  )

#### Transaction

When an `Account` holder purchases goods or services from a `Merchant` using their credit card, we create a new `Transaction` vertex in the graph with `amount` (numeric) and `created` (timestamp) properties.

A transaction is associated with an account and a merchant. Each transaction is also associated with one or more features, such as `IpAddress` or `PhoneNumber`, captured at the time the transaction was submitted to the system. As with features associated with account holders, we store each feature value as a separate vertex.

The following query shows a single transaction between an account and a merchant, together with the IP address from which the transaction was submitted. After running the query, click the `Graph` tab to see a visualization of the results.

In [None]:
%%gremlin -g type -p v,outV,inV

g.V('account-8698').
  in('ACCOUNT').limit(1).
  union(
      out('MERCHANT'),
      in('FEATURE_OF_TRANSACTION')
  ).
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'created', 'amount', 'name', 'value'))
  )

#### All transactions for an account

The following query shows all the transactions for an account. Some things to notice here:

  - There are several examples of the account making multiple payments to the same merchant. 
  - Most transactions are submitted from the IP address associated with the account. 
  - A few transactions use a different IP address. 
  - One of the transactions was submitted over the phone.

In [None]:
%%gremlin -g type -p v,outV,inV

g.V('account-8698').
  in('ACCOUNT').
  union(
      out('MERCHANT'),
      in('FEATURE_OF_TRANSACTION')
  ).
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'created', 'amount', 'name', 'value'))
  )

#### Shared households

Account holders who live at the same address will often have a number of shared features, including physical address, IP address and phone number. 

The following query shows accounts that share several of these features. Notice how accounts that belong to the same household form small, discrete components.

In [None]:
%%gremlin -g type -p v,outV,inV

g.V().hasLabel('IpAddress').
  where(
      out('FEATURE_OF_ACCOUNT').count().is(eq(2))).
  where(
      out('FEATURE_OF_ACCOUNT').
      in('FEATURE_OF_ACCOUNT').hasLabel('Address').dedup().count().is(eq(1)) 
  ).
  limit(5).
  out('FEATURE_OF_ACCOUNT').
  in('FEATURE_OF_ACCOUNT').
  simplePath().
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value', 'amount', 'created', 'name'))
  )

## Fraud Rings

Fraud rings are examples of first-party fraud. A fraud ring consists of a group of people who give false information when applying for a service, such as a credit card, in order to abuse the line of credit made available to them. The members of a fraud ring may act as "good citizens" for a while, but at some point they will coordinate their actions to leverage the credit, with no intention of paying off the debt. 

Accounts belonging to a fraud ring often willingly or inadvertently share features such as IP addresses or physical addresses. These shared features can be used to identify activities that can indicate the presence of a fraud ring.

### Account with many shared features (possible fraud ring)

The following query shows an account that is linked to a number of other accounts by way of some shared features. Note that there are more accounts here than are typically encountered in a shared household.

In [None]:
%%gremlin -g type -p v,inV,outV

g.V('account-4398046519460').
  in('FEATURE_OF_ACCOUNT').
  out('FEATURE_OF_ACCOUNT').
  simplePath().
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value', 'amount', 'created', 'name'))
  )

### Extended fraud ring

We can extend the scope of the previous to find linked accounts two hops from the starting account. The size and complexity of this account network is suggestive of a fraud ring:

In [None]:
%%gremlin -g type -p v,inV,outV,inV,outV

g.V('account-4398046519460').
  emit().
  repeat(
      in('FEATURE_OF_ACCOUNT').
      out('FEATURE_OF_ACCOUNT').
      simplePath()
  ).times(2). 
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value'))
  )

### Fraud ring and transactions

By further modifying the query, we can find all the transactions linked to the accounts in this presumed fraud ring.

Run the following query and then click the `Graph` tab to see a visualization of the results. If you grab the accounts and tease them apart you'll see that there is one account (0010-9951-1628-1609) and its transactions that are linked to a member of the supposed fraud ring only by way of a shared date of birth: this is likely a legitimate account.

In [None]:
%%gremlin -g type -p v,inV,outV,inV,outV,inV,outV

g.V('account-4398046519460').
  emit().
  repeat(
      in('FEATURE_OF_ACCOUNT').
      out('FEATURE_OF_ACCOUNT').
      simplePath()
  ).times(2). 
  in('ACCOUNT').
  out('MERCHANT').
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value', 'amount', 'created', 'name'))
  )

Here's a similar query, but instead of starting with an account, we start with a specific transaction. The results are similar, however, showing the constellation of accounts and other transactions associated with this transaction.

In [None]:
%%gremlin -g type -p v,inV,outV,inV,outV,inV,outV

g.V('05d69fc8a55e4b648318d460e748839a').
  out('ACCOUNT').as('account').
  emit().
  repeat(
      in('FEATURE_OF_ACCOUNT').
      out('FEATURE_OF_ACCOUNT').
      simplePath()
  ).times(2). 
  in('ACCOUNT').
  out('MERCHANT').
  path().from('account').
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value', 'amount', 'created', 'name'))
  )

## Find all potential fraud rings

The following query finds all accounts that appear to share features with several other accounts. Unlike most of the account- or transaction-centric graph local interactive queries in this notebook, this is more of an analytics query, which surveys all accounts in the database. The `where()` step helps identify accounts that share features with other accounts.

The results here show how difficult it is to identify the boundaries of a fraud ring. Real-world data is tangled and messy: something as simple as a shared date of birth may inadvertently associate an innocent account holder with a fraud ring. Nonetheless we can easily spot clusters of accounts that warrant further investigation.

In [None]:
%%gremlin -g type -p v,inV,outV,inV,outV

g.V().hasLabel('Account').
  where(
      in('FEATURE_OF_ACCOUNT').
      out('FEATURE_OF_ACCOUNT').
      dedup().
      count().
      is(gt(5))
  ).
  emit().
  repeat(
      in('FEATURE_OF_ACCOUNT').
      out('FEATURE_OF_ACCOUNT').
      simplePath()
  ).
  times(2). 
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value'))
  )

Filtering by IP address, we find several accounts that might serve as starting points for more detailed fraud ring investigations:

In [None]:
%%gremlin -g type -p v,inV,outV,inV,outV,inV,outV

g.V().hasLabel('Account').
  where(
      in('FEATURE_OF_ACCOUNT').
      out('FEATURE_OF_ACCOUNT').
      dedup().
      count().
      is(gt(5))
  ).
  emit().
  repeat(
      in('FEATURE_OF_ACCOUNT').hasLabel('IpAddress').
      out('FEATURE_OF_ACCOUNT').
      simplePath()
  ).
  times(2).dedup().
  in('ACCOUNT').in('FEATURE_OF_TRANSACTION').hasLabel('IpAddress').
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value', 'amount', 'created'))
  )

## Identity Theft

Identity theft is an example of third-party fraud. With identity theft, an individual or group of people steal the account and personal details of one or more account holders, and use them to purchase goods and services.

Transactions committed via identity theft are typically concentrated in a short window of time. The transactions are often associated with phone numbers or IP addresses not normally seen in the account holder's transaction history, and may reflect purchases and purchase amounts quite uncharacteristic of the account holder.

### Single victim, multiple unusual transaction features

The following query shows multiple transactions for a single account over a 2-day period. The transactions originate from several IP addresses and phone numbers not associated with the account. These are perhaps fraudulent transactions. Note that there is one transaction, for 971 dollars, that was issued from an IP address associated with the account: this is likely a legitimate transaction.

In [None]:
%%gremlin -g type

g.V('account-17592186055331').
  union(
      in('ACCOUNT').
      and(
          has('created', gte(datetime('2021-01-10'))), 
          has('created', lte(datetime('2021-01-12')))
      ).
      in('FEATURE_OF_TRANSACTION'),
      in('FEATURE_OF_ACCOUNT')
  ).
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value', 'amount', 'created', 'name'))
  ) 

Here's a similar query, but instead of starting with an account, we start with a specific transaction. The results are similar, however, showing the constellation of other transactions and features associated with the starting transaction.

In [None]:
%%gremlin -g type

g.V('a5fdd9f7b64c48d2823da969d885373d').
  out('ACCOUNT').as('account').
  union(
      in('ACCOUNT').
      and(
          has('created', gte(datetime('2021-01-10'))), 
          has('created', lte(datetime('2021-01-12')))
      ).
      in('FEATURE_OF_TRANSACTION'),
      in('FEATURE_OF_ACCOUNT')
  ).
  path().from('account').
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value', 'amount', 'created', 'name'))
  ) 

### Multiple victims, same transaction features

Another pattern of identity theft is characterised by multiple transactions being issued in a short space of time for many different accounts - all from the same IP address.

The following query shows an example of identity theft comprising transactions submitted from the same IP address that target multiple victims. 

The query starts from a particular account (0028-5873-0233-1601). We see several transactions issued from an IP address associated with this starting account: these are likely legitimate transactions. But the results also show one transaction for this account and several transactions for many other accounts, all issued in a short space of time from an IP address not associated with the starting account. These are likely fraudulent transactions committed via identity theft.

In [None]:
%%gremlin -g type -p v,inV,outV,inV,outV

g.V('account-28587302331601').
  in('ACCOUNT').
  and(
      has('created', gte(datetime('2021-01-03'))), 
      has('created', lte(datetime('2021-01-07')))
  ).
  in('FEATURE_OF_TRANSACTION').
  out('FEATURE_OF_TRANSACTION').
  and(
      has('created', gte(datetime('2021-01-03'))), 
      has('created', lte(datetime('2021-01-07')))
  ).
  union(
      out('ACCOUNT'),
      out('MERCHANT')
  ).
  path().
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value', 'amount', 'created', 'name'))
  )

Here's a similar query, but instead of starting with an account, we start with a specific transaction. The results are similar, however, showing the constellation of other transactions and features associated with the starting transaction.

In [None]:
%%gremlin -g type -p v,inV,outV,inV,outV

g.V('7dbf0290893441a699a30362d4158d4c').
  out('ACCOUNT').as('account').
  in('ACCOUNT').
  and(
      has('created', gte(datetime('2021-01-03'))), 
      has('created', lte(datetime('2021-01-07')))
  ).
  in('FEATURE_OF_TRANSACTION').
  out('FEATURE_OF_TRANSACTION').
  and(
      has('created', gte(datetime('2021-01-03'))), 
      has('created', lte(datetime('2021-01-07')))
  ).
  union(
      out('ACCOUNT'),
      out('MERCHANT')
  ).
  path().from('account').
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value', 'amount', 'created', 'name'))
  )

## Find all potential identity theft

The following query finds all transactions that were made from an IP address not associated with any accounts, but which was used in at least 4 transactions. Like the query that finds all potential fraud rings, this is more of an analytics query, which surveys all transactions in the database. The `where()` step helps tune the sensitivity of the query.

In [None]:
%%gremlin -g type -p v,inV,outV,inV,outV

g.E().hasLabel('FEATURE_OF_TRANSACTION').
  outV().as ('feature').
  where(
      and(
          out('FEATURE_OF_ACCOUNT').count().is(eq(0)),
          out('FEATURE_OF_TRANSACTION').count().is(gte(4))
      ) 
  ).
  out('FEATURE_OF_TRANSACTION').
   union(
      out('ACCOUNT'),
      out('MERCHANT'),
      in('FEATURE_OF_TRANSACTION')
  ).
  path().from('feature').
  by(
      project('type', 'value').
      by(label).
      by(valueMap('account_number', 'value', 'amount', 'created', 'name'))
  )
  
  

## Building a Fraud Detection Solution

A fraud graph is but one building block in a fraud detection solution. To define the queries used to detect unusual and potentially fraudulent activities, we must draw on the insights of Subject Matter Experts (SMEs). To help decide whether a particular event or constellation of data that we have found represents fraud, we require access to an expert predictive or decision-making process: this can vary, from an SME conducting further investigations into potential instances of fraud highlighted by the graph, to a machine learning model (ML) hosted by a service such as Amazon SageMaker scoring the results of a graph query.

### Formulating fraud detection graph queries

Neptune can very quickly find instances of data that match _known_ patterns of potentially fraudulent behaviour. But how do we determine what these patterns look like in the first place?

This is a task best accomplished by an SME reviewing historical data containing known instances of fraud to identify the kinds of structure that characterise fraud. The patterns identified by the SME are then encoded as graph queries (the kinds of graph queries shown throughout this notebook) that can be run repeatedly against existing and new graph data.

### Evaluating fraud detection graph query results

The results of a fraud detection graph query help us understand a transaction in the context of other data present in the system. But these query results are rarely sufficient in and of themselves to determine whether a transaction or series of transactions is indeed fraudulent. The results of the queries we've seen thoughout this notebook represent _potential_ instances of fraudulent behaviour. To determine whether the data that we have matched in the graph should really be considered fraud, we require access to an expert decision-making process. In some circumstances this might entail an SME further investigating the accounts and transactions highlighted by a graph query. In other cases, the solution can use a predictive ML model to score a set of results. 

### Fraud detection architectures

Amazon Neptune can be used with other Amazon Web services to build fraud detection solutions, as shown in [this diagram](https://d1.awsstatic.com/products/Neptune/fraud_graph_neptune.ffeb117372fb1e120fc6f986126120dcb3ddde86.png). You can load data directly into Neptune using query APIs, or from relational databases using the Amazon Web Services Database Migration Service. Neptune also supports bulk loading data from Amazon S3. Neptune can then be used in conjunction with Amazon SageMaker to train predictive models.


## Conclusion

This notebook has shown how you can use Amazon Neptune to create a fraud graph as part of a fraud detection solution. We've used a credit card dataset with account- and transaction-centric queries to find instances of potentially fraudulent behaviour. Real-world transaction data is tangled, messy and exceptional: in order to determine what patterns of fraud look like, and whether the results of a graph query truly represent fraud, we've suggested that the data model and queries be driven by SME insights, and that query results be assessed by an expert or predictive capability such as an ML model on Amazon SageMaker.

## What's Next?

The examples in this notebook show how to develop a fraud graph data model and accompanying queries. To build a fraud detection solution that incorporates Neptune, we recommend the following resources:

  - [Getting Started with Amazon Neptune](https://pages.awscloud.com/AWS-Learning-Path-Getting-Started-with-Amazon-Neptune_2020_LP_0009-DAT.html) is a video-based learning path that shows you how to create and connect to a Neptune database, choose a data model and query language, author and tune graph queries, and integrate Neptune with other Amazon Web services.
  - Before you begin designing your database, consult the [Amazon Web Services Reference Architectures for Using Graph Databases](https://github.com/aws-samples/aws-dbs-refarch-graph/) GitHub repo, where you can browse examples of reference deployment architectures, and learn more about building a graph data model and choosing a query language.
  - For links to documentation, blog posts, videos, and code repositories with samples and tools, see the [Amazon Neptune developer resources](https://aws.amazon.com/neptune/developer-resources/).
  - Neptune ML makes it possible to build and train useful machine learning models on large graphs in hours instead of weeks. To find out how to set up and use a graph neural network, see [Using Amazon Neptune ML for machine learning on graphs](https://docs.aws.amazon.com/neptune/latest/userguide/machine-learning.html).
  