# Edgar Holdings 

The examples in this notebook demonstrate using the GremlinPython library to connect to and work with a Neptune instance. Using a Jupyter notebook in this way provides a nice way to interact with your Neptune graph database in a familiar and instantly productive environment.

## Connect to the Neptune Database which has the load Edgar Data 

When the SageMaker notebook instance was created the appropriate Python libraries for working with a Tinkerpop enabled graph were installed. We now need to `import` some classes from those libraries before connecting to our Neptune instance, loading some sample data, and running queries. 

Below are the packages that need to be installed. This should be executed once to configure the environment. 

In [None]:
!pip install --upgrade pip
!pip install futures 
!pip install gremlinpython
!pip install SPARQLWrapper
!pip install matplotlib
!pip install numpy 
!pip install pandas 
!pip install networkx 

In [None]:
%run '../util/neptune.py'

## Establish access to our Neptune instance

Before we can work with our graph we need to establish a connection to it. This is done using the `DriverRemoteConnection` capability as defined by Apache TinkerPop and supported by GremlinPython. The `neptune.py` helper module facilitates creating this connection.

Once this cell has been run we will be able to use the variable `g` to refer to our graph in Gremlin queries in subsequent cells. By default Neptune uses port 8182 and that is what we connect to below. When you configure your own Neptune instance you can you choose a different endpoint and port number by specifiying the `neptune_endpoint` and `neptune_port` parameters to the `graphTraversal()` method.

In [None]:
endpoint="neptuneuser.cluster-carpeooi4ov5.us-east-1.neptune.amazonaws.com"
port=8182
my_region='us-east-1'
g = neptune.graphTraversal(neptune_endpoint=endpoint,neptune_port=port)
print("g = {0} ".format(g))

## Data for Analysis 
IF you are using a neptune graph that was loaded by another process then do not uncomment these line. 
IF the graph is empty you can load data from a single quater. The following code will 
 - clear the grpah of any data 
 - copy the sample data to s3 
 - bulk load the data 
For this to work you will need to have created an S3 bucket and update the variable #s3bucket to its name . You should define a 
#key to be associated with the entries. You also need to define an IAM role #NEPTUNE_LOAD_FROM_S3_ROLE_ARN as explained in the neptune documentation. https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-IAM.html


In [None]:
import boto3
s3 = boto3.resource('s3')

s3bucket="s3stockmktdata"
key="neptune"
s3role="arn:aws:iam::983739021977:role/NeptuneLoadFromS3"

fileslist= [ "security.csv" ,"securityholdr.csv","srelations.csv" ]
localhome="/home/ec2-user/SageMaker/amazon-neptune-samples/neptune-sagemaker/notebooks/edgar"
bulkloaddir="s3://{0}/{1}/".format(s3bucket,key)

neptune.clear(neptune_endpoint=endpoint,neptune_port=8182)

for filename in fileslist : 
 afile="{0}/{1}".format(localhome,filename)
 rmtfilename="{0}/{1}".format(key,filename)
 bulkloaddir="s3://{0}/{1}/{2}".format(s3bucket,key,filename)
 print("Copy local file {0} \n to s3 filename={1}".format(localfile,rmtfilename))
 print("Neptune Load of {0} ".format(bulkloaddir))
 s3.Bucket(s3bucket).upload_file(localfile,rmtfilename)
 neptune.bulkLoad(bulkloaddir,format='csv', interval=5,role=s3role,region=my_region,neptune_endpoint=endpoint,neptune_port=8182)

## Let's find out a bit about the graph

Let's start off with a simple query just to make sure our connection to Neptune is working. The queries below look at all of the vertices and edges in the graph and create two maps that show the demographic of the graph. As we are using the air routes data set, not surprisingly, the values returned are related to airports and routes.

In [None]:
vertices = g.V().groupCount().by(T.label).toList()
edges = g.E().groupCount().by(T.label).toList()
print("Vertices ={0}".format(vertices))
print("Edges = {0} ".format(edges))

Now let take a look at some of the relationships. Holder are the Vertexs that are in possesion of the security .
The edges are the details from the edgar docments that link holders to securities. 

In [None]:
securitypath = g.V().toList()

print(securitypath)

In [None]:
countedgae = g.V().hasLabel('13F-HR').by(outE().count()).by(out().groupCount().by(label))
print(countedgae)