= Load data
:toc:
:icons:
:linkattrs:
:imagesdir: ../../resources/images


== Summary

This section will load data into Amazon FSx for Lustre from the integrated Amazon S3 bucket (data repository).

The workshop environment created an FSx for Lustre file system with a data repository integrated with the "nasanex" bucket in the US West (Oregon) region. link:https://registry.opendata.aws/nasanex/[NASA NEX] is a part of the link:https://registry.opendata.aws/[Registry of Open Data on AWS] project and is a collection of Earth science datasets maintained by NASA, including climate change projections and satellite images of the Earth's surface.


== Duration

NOTE: It will take approximately 15 minutes to complete this section.


== Step-by-step Guide

IMPORTANT: Read through all steps below before continuing.

=== Lazy load data

Complete these steps from the SSH terminal session window connected to *Linux Instance 0*.

*_Copy_* the scripts below and *_paste_* it into your favorite text editor.
*_Copy_* the full path of one of the .tif files and *_replace_* the *_FILE_* placeholder.
*_Execute_* the script

. Generate a list of some .tif files
+
[source,bash]
----
time lfs find /fsx --type f --name *.tif | head -100

----
+
. *_Copy_* the full path of one of the .tif files and *_replace_* the *_FILE_* placeholder.
*_Execute_* the script.
+
. How large is the file? Run an ls -la against the file to find out the size.
+
[source,bash]
----
time ls -lah FILE # e.g. time ls -lah /fsx/NAIP/ca_1m_2012/36117/m_3611718_nw_11_1_20120623.tif

----
+
My results are below. The file I'm using is 183 MB.
+
----
-rwxr-xr-x. 1 root root 183M Aug 15  2014 /fsx/NAIP/ca_1m_2012/36117/m_3611718_nw_11_1_20120623.tif
----
+
. Read the file and measure the time it takes to load it from the integrated S3 bucket. This command will load the file by reading it from S3 and writing it to tempfs on *Linux Instance 0*.
+
[source,bash]
----
time cat FILE >/dev/shm/fsx # e.g. time cat /fsx/NAIP/ca_1m_2012/36117/m_3611718_nw_11_1_20120623.tif >/dev/shm/fsx

----
+
. How long did it take? My results are below.
+
----
real    0m4.702s
user    0m0.000s
sys     0m0.152s
----
+
. *_Re-run_* the same **"time cat..."** command and see how long it takes? My results are below.
+
[source,bash]
----
real    0m0.110s
user    0m0.000s
sys     0m0.108s
----
+
[qanda]
. Where did the last command get it's data?
The cache on *Linux Instance 0*. You can verify this by opening a second SSH terminal session window to this same instance and run `nload -u M` to see real-time network throughput of the instance and re-run the command. To force the read operation to go back to FSx for Lustre, drop the cache on *Linux Instance 0* and re-run the command.
+
[source,bash]
----
sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches'
time cat FILE >/dev/shm/fsx # e.g. time cat /fsx/NAIP/ca_1m_2012/36117/m_3611718_nw_11_1_20120623.tif >/dev/shm/fsx

----
+
. How long did it take? My results are below.
+
[source,bash]
----
real	0m0.389s
user	0m0.000s
sys     0m0.170s
----


=== Bulk load data

Complete these steps from two SSH terminal session windows connected to *Linux Instance 0*.

. Open two (2) SSH terminal session windows connected to *Linux Instance 0*.
+
Start `*nload*` in one of the SSH terminal session windows.
+
[source,bash]
----
nload -u M

----
+
*_Copy_* the scripts below and *_execute_* them in the other SSH terminal session window.
+
. Generate a list of files in the /CMIP5 directory.
+
[source,bash]
----
time tree --du -h /fsx/CMIP5

----
+
. *_Execute_* the script below to bulk load all data for files in the /CMIP5 directory from the integrated s3://nasanex bucket to the FSx for Lustre file system.
+
----
threads=36
time lfs find /fsx/CMIP5 --type f | parallel --will-cite -j ${threads} sudo lfs hsm_restore {}

----
. *_Monitor_* the network throughput of the instance from the other SSH terminal window running `*nload*`. These objects are not copied through the instance from S3 to the file system. The GET API is called in parallel to copy data directly from S3 to the OSTs.


. Open the link:https://console.aws.amazon.com/cloudwatch/[Amazon CloudWatch] console.
+
TIP: *_Context-click (right-click)_* the link above and open the link in a new tab or window to make it easy to navigate between this github workshop and AWS console.
+
. *_Select_* *Dashboards* from the left navigation pane.
. *_Select_* the link of the dashboard created as a part of the workshop environment.
+
TIP: The name of the dashboard should be the region name (e.g. us-east-1), dash (-), file system id of *Amazon FSx for Lustre* (e.g. us-east-1-fs-0123456789abcdef).
. *_Monitor_* the throughput widget to see how much throughput this bulk data load operation is achieving. The /CMIP5 directory has 775 directories, 416 files, and uses 68 GB of storage.


== Next section

Click the button below to go to the next section.

image::04-test-performance.png[link=../04-test-performance/, align="left",width=420]