# Analyzing healthcare FHIR data with Amazon Neptune

This Jupyter Notebook extends the walkthrough described in the blog on [Analyzing healthcare FHIR data with Amazon Neptune](https://aws.amazon.com/blogs/database/analyze-healthcare-fhir-data-with-amazon-neptune/). Go through the set up steps 1-3 described in the blog before issuing the queries below.

## Prerequisite: Load data from S3 into Amazon Neptune via bulk loader

Executing the cell below will open a form that you can use to submit a bulk load request to Neptune.

Adapt the values as follows:
- Source: Provide the name of your bucket as Source (e.g. s3://example.com/). Pay attention to the trailing slash.
- Format: Select __turtle__ from the dropdown.
- Region: Correct the region if it doesn't reflect the region in which your Amazon S3 bucket and Amazon Neptune cluster was created.
- Load ARN: Provide the ARN of the IAM user you created.
- Parallelism: Select __OVERSUBSCRIBE__

Keep the default values for the remaining properties. Submit load request.

Depending on the instance size of your database instance, it can take some time until the operation completes.

In [None]:
%load

If you restarted your kernel before the load was completed, you can check the status of the load by executing the following two cells.

In [None]:
# Get ID of submitted request
%load_ids

In [None]:
# Get status of request by ID
%load_status 

## Basic introduction to SPARQL

SPARQL is a query language for the Resource Description Framework (RDF), which is a graph data format designed for the web. Amazon Neptune is compatible with SPARQL 1.1. This means that you can connect to a Neptune DB instance and query the graph using the query language as described in the [SPARQL 1.1 Query specification.](https://www.w3.org/TR/sparql11-query/).

A query in SPARQL consists of a SELECT clause to specify the variables to return and a WHERE clause to specify which data to match in the graph. If you are unfamiliar with SPARQL queries, see [Writing Simple Queries](https://www.w3.org/TR/sparql11-query/#WritingSimpleQueries) in the SPARQL 1.1 Query Language.

The following query retrieves ten random triples from your graph. Triples are statements consiting of subject, predicate, object.

In [None]:
%%sparql --expand-all

SELECT *
WHERE
{ ?s ?p ?o . }
LIMIT 10

You can specify which triples you want to retrieve by specifiying subject, predicate, and/or object. In the example below we introduce a variable for the subject. The query retrieves all triples of a variable subject that is related to the object http://hl7.org/fhir/QuestionnaireResponse via the predicate http://www.w3.org/1999/02/22-rdf-syntax-ns#type. This triple matches subjects that are of the type QuestionnaireResponse. Instead of returning all values, we only return the values of the subjects. In this case ten questionnaire response IDs.

In [None]:
%%sparql --expand-all

SELECT ?questionnaireResponse
WHERE
{ 
 ?questionnaireResponse .
}
LIMIT 10

For better readability, we introduce two Prefixes, __fhir__ and __rdf__, that can be used in the WHERE clause.

In [None]:
%%sparql --expand-all

PREFIX fhir: 
PREFIX rdf: 

SELECT ?questionnaireResponse
WHERE
{ 
 ?questionnaireResponse rdf:type fhir:QuestionnaireResponse .
}
LIMIT 10

Instead of SELECT you can use CONSTRUCT to return a new RDF graph. You can specify the format of this graph in the CONSTRUCT section. Also, you can use slahes to combine multiple predicates that should be followed by the query.
The query below constructs a new graph based on the information of patient to questionnaire responses mapping. 

Navigate to the __Graph__ tab to view the graph visualization.

In [None]:
%%sparql --expand-all

PREFIX fhir: 
PREFIX qr: 

CONSTRUCT { 
 ?questionnaireResponse fhir:value ?patient .
 ?questionnaireResponse a fhir:QuestionnaireResponse .
 ?patient a fhir:Patient .
}
WHERE { 
 ?questionnaireResponse qr:subject/fhir:Reference.reference/fhir:value ?patient .
}

## Sample Queries

### 1. Identify patients, that work(ed) in same industry

The first query matches questionnaire responses with the same answer to question 4.2 “Employer industry” and returns the patients that correspond to these. As a result you can quickly identify the clusters of patients, that work(ed) in the same industry. The visualization makes it easy to identify industries that were named very often or less frequently. Pharma & Health represents an industry that was named by a large number of patients.

__Hint:__ Drag the nodes in the graph visualization to separate the clusters from each other for better visibility.

In [None]:
%%sparql --expand-all

PREFIX fhir: 
PREFIX qr: 

CONSTRUCT {
 ?questionnaireResponse fhir:value ?patient ;
 fhir:value ?industryAnswer .
 
 ?questionnaireResponse a fhir:QuestionnaireResponse .
 ?patient a fhir:Patient .
}
WHERE {
 ?questionnaireResponse qr:subject/fhir:Reference.reference/fhir:value ?patient ;
 qr:item/qr:item.item ?item4_2 .
 ?item4_2 qr:item.item.answer/qr:item.item.answer.valueString/fhir:value ?industryAnswer ;
 qr:item.item.linkId/fhir:value "4.2" .
}


### 2. Identify industries with common hazards

This query matches the answers to question 4.2 “Employer industry” and 4.3 “Hazards in Workplace”. Answers stating no hazards are filtered out. This gives an overview of hazards that ore more common in some industries than in others.

Given the density of nodes, you can identify two general clusters of industries related to more and less threatening hazards.
The first cluster contains industries related to safety, biological, chemical, and physical hazards. 

The construction industry is for example closely related to safety hazards. The second cluster contains industries related to ergonomic and workload hazards. The service & crafts industry is for example, linked to ergonomic hazards. Some questionnaire responses link industries with hazards from the other cluster, but most patients answered on hazards within one cluster. You can use this information to dive deeper into these cases and understand where the difference comes from.

__Hint:__ Drag the nodes in the graph visualization to separate the clusters from each other for better visibility.

In [None]:
%%sparql --expand-all

PREFIX fhir: 
PREFIX qr: 


CONSTRUCT {
 ?parentItem4 fhir:value ?industryAnswer ;
 fhir:value ?hazardAnswer .

 ?parentItem4 a qr:item.item .
 ?industryAnswer a fhir:value .
}
WHERE {
 ?industryAnswer ^fhir:value/^qr:item.item.answer.valueString/^qr:item.item.answer ?item4_2 .
 ?item4_2 qr:item.item.linkId/fhir:value "4.2" ;
 ^qr:item.item ?parentItem4 .
 ?parentItem4 qr:item.item ?item4_3 .
 ?item4_3 qr:item.item.linkId/fhir:value "4.3" ;
 qr:item.item.answer/qr:item.item.answer.valueString/fhir:value ?hazardAnswer .
 
 FILTER('None' != ?hazardAnswer)
}

### 3. Get questionnaires with similar answers for question group compared to single questionnaire

This sample query compares answers of patients in the question section 1 "Drinking and smoking behavior”, which contains five questions:

1. How many liters of beer do you consume in a week? 
2. How many liters of wine do you consume in a month? 
3. How many years have you been smoking?
4. How many cigarettes do you currently smoke per day? 
5. How many cigars do you currently smoke per week?

The result of the query is a list of questionnaire responses that matches the answers of a particular questionnaire response (QuestionnaireResponse/92d290e2-26a6-4474-9085-71f3b146dfd5) in at least 4 of 5 answers. The questionnaire response against which the responses are matched against is also included in the result list in this example.

In [None]:
%%sparql

PREFIX fhir: 
PREFIX qr: 

SELECT ?similarQR (count(?sameAnswerValue) as ?sameAnswerCount) 
WHERE {
 qr:item ?parentItem1_a .
 ?parentItem1_a qr:item.linkId/fhir:value "1" ;
 qr:item.item ?subItem_a .
 ?subItem_a qr:item.item.answer/qr:item.item.answer.valueInteger/fhir:value ?sameAnswerValue ;
 qr:item.item.text/fhir:value ?question .
 
 ?similarQR qr:item ?parentItem1_b .
 ?parentItem1_b qr:item.linkId/fhir:value "1" ;
 qr:item.item ?subItem_b .
 ?subItem_b qr:item.item.answer/qr:item.item.answer.valueInteger/fhir:value ?sameAnswerValue ;
 qr:item.item.text/fhir:value ?question .
}
GROUP BY ?similarQR
HAVING (?sameAnswerCount > 3) 
ORDER BY DESC(?sameAnswerCount)

## Conclusion

This notebook showed you how easy it is to load data into a graph. You also issued three different queries to illustrate how you can generate insights from FHIR data.

To dive deeper into the topic see the [Amazon Neptune developer resources](https://aws.amazon.com/neptune/developer-resources/) for documentation links, other blog posts, videos, and sample code repositories.