{ "cells": [ { "cell_type": "markdown", "id": "59cc364b-59b9-42e8-9cb6-431f6cfc9bb8", "metadata": {}, "source": [ "# Sample Julia Notebook - downloading and using data from S3\n", "[](https://studiolab.sagemaker.aws/import/github/aws/studio-lab-examples/blob/main/use-julia-in-studio-lab/2-sample-julia-nb.ipynb)" ] }, { "cell_type": "markdown", "id": "4e896402-784b-4566-8c14-e7bc8b615007", "metadata": {}, "source": [ "## Installing packages\n", "\n", "As we did in the Installing Julia notebook, we use the Pkg.add command to add a few packages that we will use for this notebook.\n", "\n", " - [Gadfly](http://gadflyjl.org/stable/): grammar of graphics plotting, similar to ggplot in R\n", " - [AWS](https://github.com/JuliaCloud/AWS.jl): third-party SDK for performing tasks in AWS\n", " - [AWSS3](https://github.com/JuliaCloud/AWSS3.jl): high-level third-party interface to Amazon Simple Storage Service\n", " - [DataFrames](https://dataframes.juliadata.org/stable/): data structure for working with tabular data, similar to Pandas in Python\n", " - [Query](http://www.queryverse.org/Query.jl/stable/): convenience functions for working with DataFrames and other queryable data sources, similar to dplyr in R.\n", " - [JSON](https://github.com/JuliaIO/JSON.jl): library for parsing JSON data\n", " - [TranscodingStreams](https://juliaio.github.io/TranscodingStreams.jl/latest/): framework for working with encoded data\n", " - [CodecZlib](https://github.com/JuliaIO/CodecZlib.jl): zlib compression codec for use with TranscodingStreams; allows us to work with gzipped data.\n", " \n", "After installing the packages, we'll load them as we use them in this notebook" ] }, { "cell_type": "code", "execution_count": 1, "id": "b4215fe4-00cb-41d1-b7f7-1f03f247847b", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32m\u001b[1m Updating\u001b[22m\u001b[39m registry at `~/.julia/registries/General.toml`\n", "\u001b[32m\u001b[1m Resolving\u001b[22m\u001b[39m package versions...\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Crayons ───────────────────── v4.0.4\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Showoff ───────────────────── v1.0.3\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Adapt ─────────────────────── v3.3.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m QueryOperators ────────────── v0.9.3\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m IrrationalConstants ───────── v0.1.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m EzXML ─────────────────────── v1.1.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m ColorTypes ────────────────── v0.11.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m OffsetArrays ──────────────── v1.10.8\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Rmath ─────────────────────── v0.7.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m XML2_jll ──────────────────── v2.9.12+0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m FFTW ──────────────────────── v1.4.5\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m StatsFuns ─────────────────── v0.9.14\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m XMLDict ───────────────────── v0.4.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m SodiumSeal ────────────────── v0.1.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m IterTools ─────────────────── v1.4.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m TableTraits ───────────────── v1.0.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Mocking ───────────────────── v0.7.3\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m IndirectArrays ────────────── v1.0.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Tables ────────────────────── v1.6.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m SpecialFunctions ──────────── v2.0.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m CategoricalArrays ─────────── v0.10.2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Contour ───────────────────── v0.5.7\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m PrettyTables ──────────────── v1.3.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Hexagons ──────────────────── v0.2.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m DataAPI ───────────────────── v1.9.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m PDMats ────────────────────── v0.11.5\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m MKL_jll ───────────────────── v2021.1.1+2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Measures ──────────────────── v0.3.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m PooledArrays ──────────────── v1.4.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m FixedPointNumbers ─────────── v0.8.4\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m StaticArrays ──────────────── v1.2.13\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m AbstractFFTs ──────────────── v1.0.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Retry ─────────────────────── v0.4.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Grisu ─────────────────────── v1.0.2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Libiconv_jll ──────────────── v1.16.1+1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Formatting ────────────────── v0.4.2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m IteratorInterfaceExtensions ─ v1.0.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m IterableTables ────────────── v1.0.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m IniFile ───────────────────── v0.5.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Gadfly ────────────────────── v1.3.4\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m URIs ──────────────────────── v1.3.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m DataValueInterfaces ───────── v1.0.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m OrderedCollections ────────── v1.4.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m KernelDensity ─────────────── v0.6.3\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m AWSS3 ─────────────────────── v0.9.3\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m TranscodingStreams ────────── v0.9.6\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m InvertedIndices ───────────── v1.1.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m ChainRulesCore ────────────── v1.11.2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m HTTP ──────────────────────── v0.9.17\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m FilePathsBase ─────────────── v0.9.17\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Loess ─────────────────────── v0.5.4\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m FillArrays ────────────────── v0.12.7\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Rmath_jll ─────────────────── v0.3.0+0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Reexport ──────────────────── v1.2.2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m QuadGK ────────────────────── v2.4.2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m DataFrames ────────────────── v1.3.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Media ─────────────────────── v0.5.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Ratios ────────────────────── v0.4.2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m AxisAlgorithms ────────────── v1.0.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m LogExpFunctions ───────────── v0.3.6\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Requires ──────────────────── v1.2.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m IntelOpenMP_jll ───────────── v2018.0.3+2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m DataStructures ────────────── v0.18.11\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Colors ────────────────────── v0.12.8\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m DataValues ────────────────── v0.4.13\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m SymDict ───────────────────── v0.3.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Query ─────────────────────── v1.0.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Distributions ─────────────── v0.25.37\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Juno ──────────────────────── v0.8.4\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m StatsAPI ──────────────────── v1.1.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m ExprTools ─────────────────── v0.1.6\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m MacroTools ────────────────── v0.5.9\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m FFTW_jll ──────────────────── v3.3.10+0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Interpolations ────────────── v0.13.5\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m OpenSpecFun_jll ───────────── v0.5.5+0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Compat ────────────────────── v3.41.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Distances ─────────────────── v0.10.7\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m InverseFunctions ──────────── v0.1.2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m TableTraitsUtils ──────────── v1.0.2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m CoupledFields ─────────────── v0.2.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Compose ───────────────────── v0.9.3\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m CodecZlib ─────────────────── v0.7.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m SortingAlgorithms ─────────── v1.0.1\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m WoodburyMatrices ──────────── v0.5.5\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m Missings ──────────────────── v1.0.2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m ChangesOfVariables ────────── v0.1.2\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m StatsBase ─────────────────── v0.33.13\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m GitHub ────────────────────── v5.7.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m TableShowUtils ────────────── v0.2.5\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m DensityInterface ──────────── v0.4.0\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m DocStringExtensions ───────── v0.8.6\n", "\u001b[32m\u001b[1m Installed\u001b[22m\u001b[39m AWS ───────────────────────── v1.74.0\n", "\u001b[32m\u001b[1m Updating\u001b[22m\u001b[39m `~/.julia/environments/v1.7/Project.toml`\n", " \u001b[90m [fbe9abb3] \u001b[39m\u001b[92m+ AWS v1.74.0\u001b[39m\n", " \u001b[90m [1c724243] \u001b[39m\u001b[92m+ AWSS3 v0.9.3\u001b[39m\n", " \u001b[90m [944b1d66] \u001b[39m\u001b[92m+ CodecZlib v0.7.0\u001b[39m\n", " \u001b[90m [a93c6f00] \u001b[39m\u001b[92m+ DataFrames v1.3.1\u001b[39m\n", " \u001b[90m [c91e804a] \u001b[39m\u001b[92m+ Gadfly v1.3.4\u001b[39m\n", " \u001b[90m [682c06a0] \u001b[39m\u001b[92m+ JSON v0.21.2\u001b[39m\n", " \u001b[90m [1a8c2f83] \u001b[39m\u001b[92m+ Query v1.0.0\u001b[39m\n", " \u001b[90m [3bb67fe8] \u001b[39m\u001b[92m+ TranscodingStreams v0.9.6\u001b[39m\n", "\u001b[32m\u001b[1m Updating\u001b[22m\u001b[39m `~/.julia/environments/v1.7/Manifest.toml`\n", " \u001b[90m [fbe9abb3] \u001b[39m\u001b[92m+ AWS v1.74.0\u001b[39m\n", " \u001b[90m [1c724243] \u001b[39m\u001b[92m+ AWSS3 v0.9.3\u001b[39m\n", " \u001b[90m [621f4979] \u001b[39m\u001b[92m+ AbstractFFTs v1.0.1\u001b[39m\n", " \u001b[90m [79e6a3ab] \u001b[39m\u001b[92m+ Adapt v3.3.1\u001b[39m\n", " \u001b[90m [13072b0f] \u001b[39m\u001b[92m+ AxisAlgorithms v1.0.1\u001b[39m\n", " \u001b[90m [324d7699] \u001b[39m\u001b[92m+ CategoricalArrays v0.10.2\u001b[39m\n", " \u001b[90m [d360d2e6] \u001b[39m\u001b[92m+ ChainRulesCore v1.11.2\u001b[39m\n", " \u001b[90m [9e997f8a] \u001b[39m\u001b[92m+ ChangesOfVariables v0.1.2\u001b[39m\n", " \u001b[90m [944b1d66] \u001b[39m\u001b[92m+ CodecZlib v0.7.0\u001b[39m\n", " \u001b[90m [3da002f7] \u001b[39m\u001b[92m+ ColorTypes v0.11.0\u001b[39m\n", " \u001b[90m [5ae59095] \u001b[39m\u001b[92m+ Colors v0.12.8\u001b[39m\n", " \u001b[90m [34da2185] \u001b[39m\u001b[92m+ Compat v3.41.0\u001b[39m\n", " \u001b[90m [a81c6b42] \u001b[39m\u001b[92m+ Compose v0.9.3\u001b[39m\n", " \u001b[90m [d38c429a] \u001b[39m\u001b[92m+ Contour v0.5.7\u001b[39m\n", " \u001b[90m [7ad07ef1] \u001b[39m\u001b[92m+ CoupledFields v0.2.0\u001b[39m\n", " \u001b[90m [a8cc5b0e] \u001b[39m\u001b[92m+ Crayons v4.0.4\u001b[39m\n", " \u001b[90m [9a962f9c] \u001b[39m\u001b[92m+ DataAPI v1.9.0\u001b[39m\n", " \u001b[90m [a93c6f00] \u001b[39m\u001b[92m+ DataFrames v1.3.1\u001b[39m\n", " \u001b[90m [864edb3b] \u001b[39m\u001b[92m+ DataStructures v0.18.11\u001b[39m\n", " \u001b[90m [e2d170a0] \u001b[39m\u001b[92m+ DataValueInterfaces v1.0.0\u001b[39m\n", " \u001b[90m [e7dc6d0d] \u001b[39m\u001b[92m+ DataValues v0.4.13\u001b[39m\n", " \u001b[90m [b429d917] \u001b[39m\u001b[92m+ DensityInterface v0.4.0\u001b[39m\n", " \u001b[90m [b4f34e82] \u001b[39m\u001b[92m+ Distances v0.10.7\u001b[39m\n", " \u001b[90m [31c24e10] \u001b[39m\u001b[92m+ Distributions v0.25.37\u001b[39m\n", " \u001b[90m [ffbed154] \u001b[39m\u001b[92m+ DocStringExtensions v0.8.6\u001b[39m\n", " \u001b[90m [e2ba6199] \u001b[39m\u001b[92m+ ExprTools v0.1.6\u001b[39m\n", " \u001b[90m [8f5d6c58] \u001b[39m\u001b[92m+ EzXML v1.1.0\u001b[39m\n", " \u001b[90m [7a1cc6ca] \u001b[39m\u001b[92m+ FFTW v1.4.5\u001b[39m\n", " \u001b[90m [48062228] \u001b[39m\u001b[92m+ FilePathsBase v0.9.17\u001b[39m\n", " \u001b[90m [1a297f60] \u001b[39m\u001b[92m+ FillArrays v0.12.7\u001b[39m\n", " \u001b[90m [53c48c17] \u001b[39m\u001b[92m+ FixedPointNumbers v0.8.4\u001b[39m\n", " \u001b[90m [59287772] \u001b[39m\u001b[92m+ Formatting v0.4.2\u001b[39m\n", " \u001b[90m [c91e804a] \u001b[39m\u001b[92m+ Gadfly v1.3.4\u001b[39m\n", " \u001b[90m [bc5e4493] \u001b[39m\u001b[92m+ GitHub v5.7.0\u001b[39m\n", " \u001b[90m [42e2da0e] \u001b[39m\u001b[92m+ Grisu v1.0.2\u001b[39m\n", " \u001b[90m [cd3eb016] \u001b[39m\u001b[92m+ HTTP v0.9.17\u001b[39m\n", " \u001b[90m [a1b4810d] \u001b[39m\u001b[92m+ Hexagons v0.2.0\u001b[39m\n", " \u001b[90m [9b13fd28] \u001b[39m\u001b[92m+ IndirectArrays v1.0.0\u001b[39m\n", " \u001b[90m [83e8ac13] \u001b[39m\u001b[92m+ IniFile v0.5.0\u001b[39m\n", " \u001b[90m [a98d9a8b] \u001b[39m\u001b[92m+ Interpolations v0.13.5\u001b[39m\n", " \u001b[90m [3587e190] \u001b[39m\u001b[92m+ InverseFunctions v0.1.2\u001b[39m\n", " \u001b[90m [41ab1584] \u001b[39m\u001b[92m+ InvertedIndices v1.1.0\u001b[39m\n", " \u001b[90m [92d709cd] \u001b[39m\u001b[92m+ IrrationalConstants v0.1.1\u001b[39m\n", " \u001b[90m [c8e1da08] \u001b[39m\u001b[92m+ IterTools v1.4.0\u001b[39m\n", " \u001b[90m [1c8ee90f] \u001b[39m\u001b[92m+ IterableTables v1.0.0\u001b[39m\n", " \u001b[90m [82899510] \u001b[39m\u001b[92m+ IteratorInterfaceExtensions v1.0.0\u001b[39m\n", " \u001b[90m [e5e0dc1b] \u001b[39m\u001b[92m+ Juno v0.8.4\u001b[39m\n", " \u001b[90m [5ab0869b] \u001b[39m\u001b[92m+ KernelDensity v0.6.3\u001b[39m\n", " \u001b[90m [4345ca2d] \u001b[39m\u001b[92m+ Loess v0.5.4\u001b[39m\n", " \u001b[90m [2ab3a3ac] \u001b[39m\u001b[92m+ LogExpFunctions v0.3.6\u001b[39m\n", " \u001b[90m [1914dd2f] \u001b[39m\u001b[92m+ MacroTools v0.5.9\u001b[39m\n", " \u001b[90m [442fdcdd] \u001b[39m\u001b[92m+ Measures v0.3.1\u001b[39m\n", " \u001b[90m [e89f7d12] \u001b[39m\u001b[92m+ Media v0.5.0\u001b[39m\n", " \u001b[90m [e1d29d7a] \u001b[39m\u001b[92m+ Missings v1.0.2\u001b[39m\n", " \u001b[90m [78c3b35d] \u001b[39m\u001b[92m+ Mocking v0.7.3\u001b[39m\n", " \u001b[90m [6fe1bfb0] \u001b[39m\u001b[92m+ OffsetArrays v1.10.8\u001b[39m\n", " \u001b[90m [bac558e1] \u001b[39m\u001b[92m+ OrderedCollections v1.4.1\u001b[39m\n", " \u001b[90m [90014a1f] \u001b[39m\u001b[92m+ PDMats v0.11.5\u001b[39m\n", " \u001b[90m [2dfb63ee] \u001b[39m\u001b[92m+ PooledArrays v1.4.0\u001b[39m\n", " \u001b[90m [08abe8d2] \u001b[39m\u001b[92m+ PrettyTables v1.3.1\u001b[39m\n", " \u001b[90m [1fd47b50] \u001b[39m\u001b[92m+ QuadGK v2.4.2\u001b[39m\n", " \u001b[90m [1a8c2f83] \u001b[39m\u001b[92m+ Query v1.0.0\u001b[39m\n", " \u001b[90m [2aef5ad7] \u001b[39m\u001b[92m+ QueryOperators v0.9.3\u001b[39m\n", " \u001b[90m [c84ed2f1] \u001b[39m\u001b[92m+ Ratios v0.4.2\u001b[39m\n", " \u001b[90m [189a3867] \u001b[39m\u001b[92m+ Reexport v1.2.2\u001b[39m\n", " \u001b[90m [ae029012] \u001b[39m\u001b[92m+ Requires v1.2.0\u001b[39m\n", " \u001b[90m [20febd7b] \u001b[39m\u001b[92m+ Retry v0.4.1\u001b[39m\n", " \u001b[90m [79098fc4] \u001b[39m\u001b[92m+ Rmath v0.7.0\u001b[39m\n", " \u001b[90m [992d4aef] \u001b[39m\u001b[92m+ Showoff v1.0.3\u001b[39m\n", " \u001b[90m [2133526b] \u001b[39m\u001b[92m+ SodiumSeal v0.1.1\u001b[39m\n", " \u001b[90m [a2af1166] \u001b[39m\u001b[92m+ SortingAlgorithms v1.0.1\u001b[39m\n", " \u001b[90m [276daf66] \u001b[39m\u001b[92m+ SpecialFunctions v2.0.0\u001b[39m\n", " \u001b[90m [90137ffa] \u001b[39m\u001b[92m+ StaticArrays v1.2.13\u001b[39m\n", " \u001b[90m [82ae8749] \u001b[39m\u001b[92m+ StatsAPI v1.1.0\u001b[39m\n", " \u001b[90m [2913bbd2] \u001b[39m\u001b[92m+ StatsBase v0.33.13\u001b[39m\n", " \u001b[90m [4c63d2b9] \u001b[39m\u001b[92m+ StatsFuns v0.9.14\u001b[39m\n", " \u001b[90m [2da68c74] \u001b[39m\u001b[92m+ SymDict v0.3.0\u001b[39m\n", " \u001b[90m [5e66a065] \u001b[39m\u001b[92m+ TableShowUtils v0.2.5\u001b[39m\n", " \u001b[90m [3783bdb8] \u001b[39m\u001b[92m+ TableTraits v1.0.1\u001b[39m\n", " \u001b[90m [382cd787] \u001b[39m\u001b[92m+ TableTraitsUtils v1.0.2\u001b[39m\n", " \u001b[90m [bd369af6] \u001b[39m\u001b[92m+ Tables v1.6.1\u001b[39m\n", " \u001b[90m [3bb67fe8] \u001b[39m\u001b[92m+ TranscodingStreams v0.9.6\u001b[39m\n", " \u001b[90m [5c2747f8] \u001b[39m\u001b[92m+ URIs v1.3.0\u001b[39m\n", " \u001b[90m [efce3f68] \u001b[39m\u001b[92m+ WoodburyMatrices v0.5.5\u001b[39m\n", " \u001b[90m [228000da] \u001b[39m\u001b[92m+ XMLDict v0.4.1\u001b[39m\n", " \u001b[90m [f5851436] \u001b[39m\u001b[92m+ FFTW_jll v3.3.10+0\u001b[39m\n", " \u001b[90m [1d5cc7b8] \u001b[39m\u001b[92m+ IntelOpenMP_jll v2018.0.3+2\u001b[39m\n", " \u001b[90m [94ce4f54] \u001b[39m\u001b[92m+ Libiconv_jll v1.16.1+1\u001b[39m\n", " \u001b[90m [856f044c] \u001b[39m\u001b[92m+ MKL_jll v2021.1.1+2\u001b[39m\n", " \u001b[90m [efe28fd5] \u001b[39m\u001b[92m+ OpenSpecFun_jll v0.5.5+0\u001b[39m\n", " \u001b[90m [f50d1b31] \u001b[39m\u001b[92m+ Rmath_jll v0.3.0+0\u001b[39m\n", " \u001b[90m [02c8fc9c] \u001b[39m\u001b[92m+ XML2_jll v2.9.12+0\u001b[39m\n", " \u001b[90m [8bb1440f] \u001b[39m\u001b[92m+ DelimitedFiles\u001b[39m\n", " \u001b[90m [8ba89e20] \u001b[39m\u001b[92m+ Distributed\u001b[39m\n", " \u001b[90m [9fa8497b] \u001b[39m\u001b[92m+ Future\u001b[39m\n", " \u001b[90m [4af54fe1] \u001b[39m\u001b[92m+ LazyArtifacts\u001b[39m\n", " \u001b[90m [37e2e46d] \u001b[39m\u001b[92m+ LinearAlgebra\u001b[39m\n", " \u001b[90m [9abbd945] \u001b[39m\u001b[92m+ Profile\u001b[39m\n", " \u001b[90m [1a1011a3] \u001b[39m\u001b[92m+ SharedArrays\u001b[39m\n", " \u001b[90m [2f01184e] \u001b[39m\u001b[92m+ SparseArrays\u001b[39m\n", " \u001b[90m [10745b16] \u001b[39m\u001b[92m+ Statistics\u001b[39m\n", " \u001b[90m [4607b0f0] \u001b[39m\u001b[92m+ SuiteSparse\u001b[39m\n", " \u001b[90m [e66e0078] \u001b[39m\u001b[92m+ CompilerSupportLibraries_jll\u001b[39m\n", " \u001b[90m [4536629a] \u001b[39m\u001b[92m+ OpenBLAS_jll\u001b[39m\n", " \u001b[90m [05823500] \u001b[39m\u001b[92m+ OpenLibm_jll\u001b[39m\n", " \u001b[90m [8e850b90] \u001b[39m\u001b[92m+ libblastrampoline_jll\u001b[39m\n", "\u001b[32m\u001b[1mPrecompiling\u001b[22m\u001b[39m project...\n", "\u001b[32m ✓ \u001b[39m\u001b[90mRequires\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mOrderedCollections\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mCompat\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mDataValueInterfaces\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mFillArrays\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mSymDict\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mAbstractFFTs\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mOpenLibm_jll\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mInvertedIndices\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mMeasures\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mFixedPointNumbers\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mPDMats\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mReexport\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mInverseFunctions\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mIniFile\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mIndirectArrays\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mExprTools\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mDocStringExtensions\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mHexagons\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mIrrationalConstants\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mIteratorInterfaceExtensions\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mAdapt\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mURIs\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mRetry\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mDataAPI\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mCompilerSupportLibraries_jll\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mStatsAPI\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mWoodburyMatrices\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mCrayons\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mIterTools\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mFormatting\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mGrisu\u001b[39m\n", "\u001b[32m ✓ \u001b[39mTranscodingStreams\n", "\u001b[32m ✓ \u001b[39m\u001b[90mRmath_jll\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mLibiconv_jll\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mMacroTools\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mIntelOpenMP_jll\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mSodiumSeal\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mFFTW_jll\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mRatios\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mDataValues\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mChainRulesCore\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mFilePathsBase\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mMocking\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mDensityInterface\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mTableTraits\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mStaticArrays\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mDataStructures\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mPooledArrays\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mOffsetArrays\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mOpenBLAS_jll\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mMissings\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mAxisAlgorithms\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mDistances\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mOpenSpecFun_jll\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mShowoff\u001b[39m\n", "\u001b[32m ✓ \u001b[39mCodecZlib\n", "\u001b[32m ✓ \u001b[39m\u001b[90mHTTP\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mRmath\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mXML2_jll\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mMedia\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mTableShowUtils\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mMKL_jll\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mChangesOfVariables\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mSortingAlgorithms\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mlibblastrampoline_jll\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mTables\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mTableTraitsUtils\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mContour\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mQuadGK\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mLoess\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mCategoricalArrays\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mEzXML\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mJuno\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mGitHub\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mQueryOperators\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mLogExpFunctions\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mInterpolations\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mIterableTables\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mXMLDict\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mColorTypes\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mStatsBase\u001b[39m\n", "\u001b[32m ✓ \u001b[39mQuery\n", "\u001b[32m ✓ \u001b[39m\u001b[90mPrettyTables\u001b[39m\n", "\u001b[32m ✓ \u001b[39mAWS\n", "\u001b[32m ✓ \u001b[39m\u001b[90mSpecialFunctions\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mCoupledFields\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mFFTW\u001b[39m\n", "\u001b[32m ✓ \u001b[39mAWSS3\n", "\u001b[32m ✓ \u001b[39m\u001b[90mStatsFuns\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mColors\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mDistributions\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mKernelDensity\u001b[39m\n", "\u001b[32m ✓ \u001b[39m\u001b[90mCompose\u001b[39m\n", "\u001b[32m ✓ \u001b[39mGadfly\n", "\u001b[32m ✓ \u001b[39mDataFrames\n", " 96 dependencies successfully precompiled in 58 seconds (15 already precompiled)\n" ] } ], "source": [ "using Pkg\n", "Pkg.add([\"Gadfly\", \"AWS\", \"AWSS3\", \"DataFrames\", \"Query\", \"JSON\", \"TranscodingStreams\", \"CodecZlib\"])" ] }, { "cell_type": "markdown", "id": "03831149-6b10-457b-bf8e-34fad25944b6", "metadata": {}, "source": [ "## Example plots using Gadfly\n", "\n", "First up, we'll plot a histogram of some normally-distributed random numbers. Then, we'll plot an analogous 2 dimensional density plot of some normally-distributed random numbers." ] }, { "cell_type": "code", "execution_count": 2, "id": "458a5b21-2859-46da-a70e-7a649e581a7f", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n" ], "text/html": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Plot(...)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using Gadfly\n", "\n", "# Note that Gadfly is JIT compiled, which means that it can take a little time\n", "# for the first plot to render. Subsequent plots are speedy.\n", "plot(x = randn(10000),\n", " Geom.histogram)" ] }, { "cell_type": "code", "execution_count": 3, "id": "d941ab84-b2ab-4256-8018-32259fa37e17", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n" ], "text/html": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Plot(...)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plot(x = randn(1000),\n", " y = randn(1000),\n", " Geom.density2d)" ] }, { "cell_type": "markdown", "id": "b0ebe5dc-5d71-4e09-93ae-a29d718da4f9", "metadata": {}, "source": [ "## Accessing and using data from Amazon S3\n", "\n", "Here we will use the third-party AWS SDK and high-level S3 interface. The data we will use [are listed at the Registry of Open Data on AWS (RODA)](https://registry.opendata.aws/lei/), and are stored in a public S3 bucket s3://gleif/ in the eu-central-1 region. This dataset uniquely identifies legal entities using a code, and also provides information about ownership structures.\n", "\n", "We will access these data anonymously, which can be done by creating and using an `aws_config` structure with `creds=nothing`. Note that we must also supply the bucket's region with our configuration.\n", "\n", "Our first operation will be to list out all the keys in this S3 bucket that begin with \"data/json/lei\". We can see that the data are organized by date and time and have a \".json.gz\" extension, denoting that these data are json formatted and compressed using gzip." ] }, { "cell_type": "code", "execution_count": 4, "id": "75dbc83b-7241-4bc6-a6f0-18cb98d5097f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4231-element Vector{String}:\n", " \"data/json/lei/date=2018-02-09/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-09/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-09/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-10/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-10/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-10/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-11/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-11/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-11/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-12/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-12/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-12/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2018-02-13/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " ⋮\n", " \"data/json/lei/date=2021-12-16/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2021-12-17/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2021-12-17/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2021-12-17/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2021-12-18/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2021-12-18/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2021-12-18/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2021-12-19/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2021-12-19/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2021-12-19/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2021-12-20/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"\n", " \"data/json/lei/date=2021-12-20/t\" ⋯ 21 bytes ⋯ \"00-gleif-goldencopy-lei.json.gz\"" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using AWS\n", "using AWSS3 # high-level interface to S3\n", "\n", "# This AWSConfig instance allows us to make anonymous / unsigned requests\n", "aws = global_aws_config(;creds = nothing, region = \"eu-central-1\")\n", "\n", "# s3_list_keys returns a generator that we can collect immediately using collect\n", "obj = collect(s3_list_keys(aws, \"gleif\", \"data/json/lei/\"; delimiter = \"\"))" ] }, { "cell_type": "markdown", "id": "a4a2d5c3-08ba-48fc-b0df-e1dee7aeda16", "metadata": {}, "source": [ "Next, we'll load one of the compressed JSON objects from S3. Since it is gzipped, we'll need to decompress the object. And because the object is also in JSON format, we'll need to parse the JSON to work with it in Julia.\n", "\n", "Objects in S3 can be very large ([up to 5TB in a single object as of this writing](https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html)). Processing larger objects may require an incremental strategy, however in this case the objects we're interested in can safely fit in memory. We therefore download, decompress, and parse the whole object at once." ] }, { "cell_type": "code", "execution_count": 5, "id": "f9cca693-52c3-4f40-816b-08f6954876e2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dict{String, Any} with 1 entry:\n", " \"relations\" => Any[Dict{String, Any}(\"RelationshipRecord\"=>Dict{String, Any}(…" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using JSON, TranscodingStreams, CodecZlib\n", "\n", "rr = JSON.parse(String(transcode(GzipDecompressor, read(S3Path(\"s3://gleif/data/json/rr/date=2021-12-10/time=00:00/20211210-0000-gleif-goldencopy-rr.json.gz\")))))" ] }, { "cell_type": "markdown", "id": "143be21e-32b1-4629-91f9-66992e079d98", "metadata": {}, "source": [ "The resulting dictionary object contains a \"relations\" key that contains a number of records pertaining to the relationships between legal entities. To help us understand the structure of an individual record, we can print the first record as JSON." ] }, { "cell_type": "code", "execution_count": 6, "id": "87b1ba1e-b272-49d1-ace6-d6a977498fc0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"RelationshipRecord\": {\n", " \"Registration\": {\n", " \"InitialRegistrationDate\": {\n", " \"$\": \"2012-10-01T11:09:00.000Z\"\n", " },\n", " \"RegistrationStatus\": {\n", " \"$\": \"PUBLISHED\"\n", " },\n", " \"NextRenewalDate\": {\n", " \"$\": \"2022-06-03T13:04:00.000Z\"\n", " },\n", " \"LastUpdateDate\": {\n", " \"$\": \"2021-06-09T21:32:00.000Z\"\n", " },\n", " \"ValidationSources\": {\n", " \"$\": \"ENTITY_SUPPLIED_ONLY\"\n", " },\n", " \"ValidationDocuments\": {\n", " \"$\": \"SUPPORTING_DOCUMENTS\"\n", " },\n", " \"ManagingLOU\": {\n", " \"$\": \"EVK05KS7XY1DEII3R011\"\n", " }\n", " },\n", " \"Relationship\": {\n", " \"RelationshipStatus\": {\n", " \"$\": \"ACTIVE\"\n", " },\n", " \"EndNode\": {\n", " \"NodeIDType\": {\n", " \"$\": \"LEI\"\n", " },\n", " \"NodeID\": {\n", " \"$\": \"ZZG38T0MDR3QY1ETUA76\"\n", " }\n", " },\n", " \"StartNode\": {\n", " \"NodeIDType\": {\n", " \"$\": \"LEI\"\n", " },\n", " \"NodeID\": {\n", " \"$\": \"010CMKZ3VON21WF2ZD45\"\n", " }\n", " },\n", " \"RelationshipPeriods\": {\n", " \"RelationshipPeriod\": [\n", " {\n", " \"EndDate\": {\n", " \"$\": \"2018-12-31T00:00:00.000Z\"\n", " },\n", " \"StartDate\": {\n", " \"$\": \"2018-01-01T00:00:00.000Z\"\n", " },\n", " \"PeriodType\": {\n", " \"$\": \"ACCOUNTING_PERIOD\"\n", " }\n", " },\n", " {\n", " \"StartDate\": {\n", " \"$\": \"2018-02-06T00:00:00.000Z\"\n", " },\n", " \"PeriodType\": {\n", " \"$\": \"RELATIONSHIP_PERIOD\"\n", " }\n", " }\n", " ]\n", " },\n", " \"RelationshipType\": {\n", " \"$\": \"IS_DIRECTLY_CONSOLIDATED_BY\"\n", " },\n", " \"RelationshipQualifiers\": {\n", " \"RelationshipQualifier\": {\n", " \"QualifierDimension\": {\n", " \"$\": \"ACCOUNTING_STANDARD\"\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n" ] } ], "source": [ "print(json(rr[\"relations\"][1], 4)) # the 4 tells, the json library to indent using 4 spaces" ] }, { "cell_type": "markdown", "id": "644deb47-de5d-4d28-bc58-b1b44682bfff", "metadata": {}, "source": [ "To help us do some simple analysis of these relationship records, we'll construct a DataFrame containing just a handful fields that we're interested in. Here we construct the DataFrame from named columns of vectors extracted from the dictionary we generated from the JSON from S3." ] }, { "cell_type": "code", "execution_count": 7, "id": "92d90a1c-c2c1-4770-a5fe-3abe9538dd52", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
173,523 rows × 3 columns
startnode | endnode | relationshiptype | |
---|---|---|---|
String | String | String | |
1 | 010CMKZ3VON21WF2ZD45 | ZZG38T0MDR3QY1ETUA76 | IS_DIRECTLY_CONSOLIDATED_BY |
2 | 010CMKZ3VON21WF2ZD45 | 3C7474T6CDKPR9K6YT90 | IS_ULTIMATELY_CONSOLIDATED_BY |
3 | 010PWNH4K3BLIC3I7R03 | 549300COKYB5EGSU1838 | IS_DIRECTLY_CONSOLIDATED_BY |
4 | 010PWNH4K3BLIC3I7R03 | 549300B2Q47IR0CR5B54 | IS_ULTIMATELY_CONSOLIDATED_BY |
5 | 01J4SO3XTWZF4PP38209 | 5493000HPQ4D2RZ79739 | IS_DIRECTLY_CONSOLIDATED_BY |
6 | 01J4SO3XTWZF4PP38209 | 5493000HPQ4D2RZ79739 | IS_ULTIMATELY_CONSOLIDATED_BY |
7 | 01TRDHWDCL69YP41S025 | NZKGWYDO5O20EZTLPF09 | IS_DIRECTLY_CONSOLIDATED_BY |
8 | 01TRDHWDCL69YP41S025 | LORM1GNEU1DKEW527V90 | IS_ULTIMATELY_CONSOLIDATED_BY |
9 | 020BQJXAXCZNLKIN7326 | 549300PFEWKNHRG25N08 | IS_ULTIMATELY_CONSOLIDATED_BY |
10 | 0292001381F1R1IB5B85 | 029200388A7S7Z5I0H57 | IS_DIRECTLY_CONSOLIDATED_BY |
11 | 0292001381F1R1IB5B85 | 029200388A7S7Z5I0H57 | IS_ULTIMATELY_CONSOLIDATED_BY |
12 | 0292001568C3M3WGI292 | 0292004558B0R5B7I133 | IS_DIRECTLY_CONSOLIDATED_BY |
13 | 0292001568C3M3WGI292 | 0292004558B0R5B7I133 | IS_ULTIMATELY_CONSOLIDATED_BY |
14 | 0292002717G4T0CH6E65 | 029200370B2O2ZA1C754 | IS_DIRECTLY_CONSOLIDATED_BY |
15 | 0292002717G4T0CH6E65 | 029200370B2O2ZA1C754 | IS_ULTIMATELY_CONSOLIDATED_BY |
16 | 029200302D4K9BC3E535 | 029200343C4N7D2E6H83 | IS_DIRECTLY_CONSOLIDATED_BY |
17 | 029200302D4K9BC3E535 | 6SHGI4ZSSLCXXQSBB395 | IS_ULTIMATELY_CONSOLIDATED_BY |
18 | 0292003053E9T6XD4J69 | 029200268F8M5YI5I629 | IS_DIRECTLY_CONSOLIDATED_BY |
19 | 0292003053E9T6XD4J69 | 029200268F8M5YI5I629 | IS_ULTIMATELY_CONSOLIDATED_BY |
20 | 029200325G5L3B56F539 | 254900O2M53JMNF7U320 | IS_DIRECTLY_CONSOLIDATED_BY |
21 | 029200325G5L3B56F539 | 029200370B2O2ZA1C754 | IS_ULTIMATELY_CONSOLIDATED_BY |
22 | 0292003272D0K9BC3A91 | 0292002595F3Q7UF5D85 | IS_DIRECTLY_CONSOLIDATED_BY |
23 | 0292003272D0K9BC3A91 | 0292002595F3Q7UF5D85 | IS_ULTIMATELY_CONSOLIDATED_BY |
24 | 029200343C4N7D2E6H83 | CYSD4L2598RQ9BPPCB72 | IS_DIRECTLY_CONSOLIDATED_BY |
25 | 029200343C4N7D2E6H83 | 6SHGI4ZSSLCXXQSBB395 | IS_ULTIMATELY_CONSOLIDATED_BY |
26 | 0292003540H0S4VA7A50 | 0292002717G4T0CH6E65 | IS_DIRECTLY_CONSOLIDATED_BY |
27 | 0292003540H0S4VA7A50 | 0292002717G4T0CH6E65 | IS_ULTIMATELY_CONSOLIDATED_BY |
28 | 029200355B5N3V4F1F58 | 029200239B8P7V6L1I90 | IS_DIRECTLY_CONSOLIDATED_BY |
29 | 029200355B5N3V4F1F58 | 029200239B8P7V6L1I90 | IS_ULTIMATELY_CONSOLIDATED_BY |
30 | 0292003601E9LB4J5157 | 254900O2M53JMNF7U320 | IS_DIRECTLY_CONSOLIDATED_BY |
⋮ | ⋮ | ⋮ | ⋮ |
34,243 rows × 2 columns
endnode | count | |
---|---|---|
String | Int64 | |
1 | 784F5XWPLTWKTBV3E584 | 405 |
2 | 7LTWFZYICNSX8D621K86 | 365 |
3 | K8MS7FD7N5Z2WQ51AZ71 | 362 |
4 | W38RGI023J3WT1HWRP32 | 308 |
5 | 213800JH9QQWHLO99821 | 274 |
6 | TKPE0FGSGCIV1TZ04B42 | 256 |
7 | O4QK7KMMK83ITNTHUG69 | 255 |
8 | O2RNE8IBXP4R0TD8PU41 | 253 |
9 | 5493000GV3I1S3MMF497 | 246 |
10 | 549300TRUWO2CD2G5692 | 205 |
11 | H7FNTJ4851HG0EXQ1Z70 | 205 |
12 | 529900MUF4C20K50JS49 | 202 |
13 | 96950066U5XAAIRCPA78 | 187 |
14 | 52990063S23N13HU4E98 | 169 |
15 | HKUPACFHSSASQK8HLS17 | 164 |
16 | 335800AMYJCXE3KCAC26 | 133 |
17 | OQ3T05P7YR8P5YJEVI93 | 130 |
18 | NNVPP80YIZGEY2314M97 | 123 |
19 | E57ODZWZ7FF32TWEFA76 | 122 |
20 | R0MUWSFPU8MPRO8K5P83 | 112 |
21 | 529900K9B0N5BT694847 | 110 |
22 | 335800X9S32QAPOTYJ95 | 107 |
23 | I07WOS4YJ0N7YRFE7309 | 106 |
24 | F5WCUMTUM4RKZ1MAIE39 | 105 |
25 | 549300J4U55H3WP1XT59 | 103 |
26 | Q9MAIUP40P25UFBFG033 | 102 |
27 | 529900NNUPAGGOMPXZ31 | 101 |
28 | 335800BITJZHVZ1CN475 | 98 |
29 | 549300PPXHEU2JF0AM85 | 97 |
30 | 335800GRUPRB6ZIXOV36 | 97 |
⋮ | ⋮ | ⋮ |