{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction to AWS ParallelCluster\n",
    "\n",
    "For an overview of this workshop, please read [README.md](README.md). We recommend you use Amazon SageMaker Jupyter Notebook instance for this workshop. \n",
    "\n",
    "This notebook shows the main steps to create a ParallelCluster. Steps to prepare (pre and post cluster creation) for the ParallelCluster are coded in pcluster-athena.py. \n",
    "\n",
    "#### Before: \n",
    "- Create ssh key\n",
    "- Create or use existing default VPC\n",
    "- Create an S3 bucket for config, post install, sbatch, Athena++ initial condition files\n",
    "- Create or use exsiting MySQL RDS database for Slurm accounting\n",
    "- Update VPC security group to allow traffic to MySQL RDS\n",
    "- Create a secret for RDS in AWS Secret Manager\n",
    "- Upload post install script to S3\n",
    "- Fill the ParallelCluster config template with values of region, VPC_ID, SUBNET_ID, KEY_NAME, POST_INSTALLSCRIPT_LOCATION and POST_INSALL_SCRIPT_ARGS \n",
    "- Upload the config file to S3\n",
    "\n",
    "#### After\n",
    "- Update VPC security group attached to the headnode to open port 8082 (SLURM REST port) to this notebook\n",
    "\n",
    "#### Note: \n",
    "The reason to break out the process into before and after is to capture the output of the cluster creation from \"pcluster create\" command and display the output in the notebook.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Before you start - install nodejs in the current kernal\n",
    "\n",
    "**Note** If you are in an AWS led workshop using Workshop Studio account, you do not need to do the following. Node.js has been installed for you.\n",
    "\n",
    "pcluster3 requires nodejs executables. We wil linstall that in the current kernal. \n",
    "\n",
    "SageMaker Jupyter notebook comes with multiple kernals. We use \"conda_python3\" in this workshop. If you need to switch to another kernal, please change the kernal in the following instructions accordingly. \n",
    "\n",
    "1. Open a terminal window from File/New/Ternimal - this will open a terminal with \"sh\" shell.\n",
    "2. exetute ```bash``` command to switch to \"bash\" shell\n",
    "3. execute ```conda activate python3```\n",
    "4. execute the following commands (you can cut and paste the following lines and paste into the terminal)\n",
    "\n",
    "```\n",
    "curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh | bash\n",
    "chmod +x ~/.nvm/nvm.sh\n",
    "~/.nvm/nvm.sh\n",
    "bash\n",
    "nvm install v16.3.0\n",
    "node --version\n",
    "```\n",
    "\n",
    "**After you installed nodejs in the current kernel, restart the kernal by reselecting the \"conda_python3\" on the top right corner of the notebook.** You should see the output of the version of node, such as \"v16.9.1\" after running the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!node --version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# install the pcluster CLI\n",
    "#!pip3 install --upgrade pip\n",
    "!pip install urllib3==1.26.6\n",
    "\n",
    "pcluster_version=\"3.6.0\"\n",
    "!sudo /usr/local/bin/pip3 install --upgrade aws-parallelcluster==$pcluster_version\n",
    "\n",
    "!pcluster version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import botocore\n",
    "import json\n",
    "import time\n",
    "import os\n",
    "import sys\n",
    "import base64\n",
    "import docker\n",
    "import pandas as pd\n",
    "import importlib\n",
    "import project_path # path to helper methods\n",
    "from lib import workshop\n",
    "from botocore.exceptions import ClientError\n",
    "from IPython.display import HTML, display\n",
    "\n",
    "#sys.path.insert(0, '.')\n",
    "import pcluster_athena\n",
    "importlib.reload(pcluster_athena)\n",
    "\n",
    "\n",
    "REGION=\"us-east-1\"\n",
    "\n",
    "# default post install script include steps to compile slurm, hdf5, athena++, which could take over 25 minutes. \n",
    "# For a short workshop, we will pull compiled slurm, hdf5 and athena++ from an URL and shorten the cluster creation time to around 10 minutes. \n",
    "# default script\n",
    "pcluster_name = 'myPC5c'\n",
    "config_name = \"config3.yaml\"\n",
    "post_install_script_prefix = \"scripts/pcluster_post_install_fast.sh\"\n",
    "\n",
    "# uncomment this to use c7g\n",
    "#pcluster_name = 'myPC7g'\n",
    "#config_name = 'config3-c7g.yaml'\n",
    "#post_install_script_prefix = 'scripts/pcluster_post_install_7g.sh'\n",
    "\n",
    "\n",
    "\n",
    "# create the build folder if doesn't exist already\n",
    "!mkdir -p build\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the PClusterHelper module"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this is used during developemnt, to reload the module after a change in the module\n",
    "try:\n",
    "    del sys.modules['pcluster_athena']\n",
    "except:\n",
    "    #ignore if the module is not loaded\n",
    "    print('Module not loaded, ignore')\n",
    "    \n",
    "from pcluster_athena import PClusterHelper\n",
    "\n",
    "importlib.reload(workshop)\n",
    "    \n",
    "from pcluster_athena import PClusterHelper\n",
    "\n",
    "# create the cluster - # You can rerun the rest of the notebook again with no harm. There are checks in place for existing resoources. \n",
    "pcluster_helper = PClusterHelper(pcluster_name, config_name, post_install_script_prefix)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the parallel cluster\n",
    "\n",
    "If you are using the default post install script, the process will take up to 30 minutes. The two steps that take longer than the rest are the creation of a MySQL RDS instance, and the installation of Slurm, Athena++, HDF5 libs when running the post installation script. If you are using the faster version of the script (scripts/pcluster_post_install_fast.sh), the post installation script will pull in compiled Slurm, Athena++ and HDF5, so the process will only take around 15 minutes. \n",
    "\n",
    "**Note**: If you want to see how each step is done, please use the \"pcluster-athena++\" notebook in the same directory.\n",
    "\n",
    "**Note**: If you are running a workshop using this notebook, you can create the MySQL RDS instance beforehand, using the CloudFormation template in this directory \"db-create.yml\". This will also speedup the cluster creation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create pre-reqs for the cluster, IAMs, VPCs etc.\n",
    "pcluster_helper.create_before()\n",
    "\n",
    "# Create the cluster\n",
    "!pcluster create-cluster --cluster-name $pcluster_helper.pcluster_name --rollback-on-failure False --cluster-configuration build/$config_name --region $pcluster_helper.region\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "In ParallelCluster v3, the cluster creation command is run asynchronously. Use a waitier to monitor the creation process. This will take aroudn 20 minutes. While you are waiting, please check out the config file under \"build\" folder and post_install scripts used for cluster creation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "cf_client = boto3.client('cloudformation')\n",
    "\n",
    "waiter = cf_client.get_waiter('stack_create_complete')\n",
    "\n",
    "try:\n",
    "    print(\"Waiting for cluster creation to complete ... \")\n",
    "    waiter.wait(StackName=pcluster_name)\n",
    "except botocore.exceptions.WaiterError as e:\n",
    "    print(e)\n",
    "\n",
    "print(\"Cluster creation completed. \")\n",
    "pcluster_helper.create_after()\n",
    "\n",
    "resp=cf_client.describe_stacks(StackName=pcluster_name)\n",
    "\n",
    "print(resp)\n",
    "\n",
    "outputs=resp[\"Stacks\"][0][\"Outputs\"]\n",
    "\n",
    "slurm_host=''\n",
    "for o in outputs:\n",
    "    if o['OutputKey'] == 'HeadNodePrivateIP':\n",
    "        slurm_host = o['OutputValue']\n",
    "        print(\"Slurm REST endpoint is on \", slurm_host)\n",
    "        break;\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SSH onto the headnode\n",
    "\n",
    "The previous step also created an SSH key with the name \"pcluster-athena-key.pem\" in the notebooks/parallelcluster folder. We can use that key to ssh onto the headnode of the cluster \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cp -f pcluster-athena-key.pem ~/.ssh/pcluster-athena-key.pem\n",
    "!chmod 400 ~/.ssh/pcluster-athena-key.pem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Open a terminal from \"File/New/Terminal and execute the following command \n",
    "\n",
    "```\n",
    "pcluster_name=$(pcluster list-clusters --region us-east-1 | jq \".clusters[0].clusterName\" | tr -d '\"')\n",
    "\n",
    "pcluster ssh --cluster-name $pcluster_name -i ~/.ssh/pcluster-athena-key.pem --region us-east-1\n",
    "```\n",
    "\n",
    "After you login to the head node, you can try Slurm commands such as \n",
    "- sinfo\n",
    "- squeue\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Integrate with Slurm REST API running on the head node\n",
    "\n",
    "\n",
    "![Slurmrestd_diagram](images/parallelcluster_restd_diagram.png \"Slurm REST API on AWS ParallelCluster\")\n",
    "\n",
    "SLURM REST is currently running on the headnode. The JWT token is stored in AWS Secret Manager from the head node. You will need that JWT token in the header of all your REST API requests. \n",
    "\n",
    "Don't forget to add secretsmanager:GetSecretValue permission to the sagemaker execution role that runs this notebook\n",
    "\n",
    "JWT token\n",
    "To pass it securely to this notebook, we will first create a cron job on the headnode to retrieve the token, then save it in Secrete Manager with a name \"slurm_token_{cluster_name}\". The default JWT token lifespan is 1800 seconds(30 mins). Run the follow script on the head-node as a cron job to update the token every 20 mins\n",
    "\n",
    "The following steps are included in the post_install_script. You DO NOT need to run it.\n",
    "\n",
    "Step 1. Add permission to the instance role for the head-node\n",
    "We use additional_iam_role in the pcluster configuration file to attach SecretManager read/write policy to the instance role on the cluster.\n",
    "\n",
    "Step 2. Create a script \"token_refresher.sh\"\n",
    "Assume we save the following script at /shared/token_refresher.sh\n",
    "```\n",
    "#!/bin/bash\n",
    "\n",
    "REGION=us-east-1\n",
    "export $(/opt/slurm/bin/scontrol token -u slurm)\n",
    "\n",
    "aws secretsmanager describe-secret --secret-id slurm_token --region $REGION\n",
    "\n",
    "if [ $? -eq 0 ]\n",
    "then\n",
    " aws secretsmanager update-secret --secret-id slurm_token --secret-string \"$SLURM_JWT\" --region $REGION\n",
    "else\n",
    " aws secretsmanager create-secret --name slurm_token --secret-string \"$SLURM_JWT\" --region $REGION\n",
    "fi\n",
    "```\n",
    "\n",
    "Step 3. Add a file \"slurm-token\" in /etc/cron.d/\n",
    "```\n",
    "# Run the slurm token update every 20 minues \n",
    "SHELL=/bin/bash\n",
    "PATH=/sbin:/bin:/usr/sbin:/usr/bin\n",
    "MAILTO=root\n",
    "*/20 * * * * root /shared/token_refresher.sh            \n",
    "\n",
    "```\n",
    "\n",
    "Step 4. Add permission to access SecretManager for this notebook\n",
    "Don't forget to add secretsmanager:GetSecretValue permission to the sagemaker execution role that runs this notebook\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspect the Slurm REST API Schema\n",
    "\n",
    "We will start by examing the Slurm REST API schema\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import requests\n",
    "import json\n",
    "\n",
    "slurm_openapi_ep = 'http://'+slurm_host+':8082/openapi/v3'\n",
    "slurm_rest_base='http://'+slurm_host+':8082/slurm/v0.0.37'\n",
    "\n",
    "_, get_headers = pcluster_helper.update_header_token()\n",
    "\n",
    "print(get_headers)\n",
    "\n",
    "resp_api = requests.get(slurm_openapi_ep, headers=get_headers)\n",
    "print(resp_api)\n",
    "\n",
    "if resp_api.status_code != 200:\n",
    "    # This means something went wrong.\n",
    "    print(\"Error\" , resp_api.status_code)\n",
    "\n",
    "with open('build/slurm_api.json', 'w') as outfile:\n",
    "    json.dump(resp_api.json(), outfile)\n",
    "\n",
    "print(json.dumps(resp_api.json(), indent=2))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use REST API callls to interact with ParallelCluster\n",
    "\n",
    "Then we will make direct REST API requests to retrieve the partitions in response\n",
    "\n",
    "If you get server errors, you can login to the head-node and check the system logs of \"slurmrestd\", which is running as a service. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "partition_info = [\"name\", \"nodes\", \"total_cpus\", \"total_nodes\"]\n",
    "\n",
    "##### This works as well, \n",
    "# update header in case the token has expired\n",
    "_, get_headers = pcluster_helper.update_header_token()\n",
    "\n",
    "##### call REST API directly\n",
    "slurm_partitions_url= slurm_rest_base+'/partitions/'\n",
    "partitions = pcluster_helper.get_response_as_json(slurm_partitions_url)\n",
    "pcluster_helper.print_table_from_json_array(partition_info, partitions['partitions'])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Submit a job\n",
    "\n",
    "**A note about using a python client generated based on the REST API schema**: The slurm_rest_api_client job submit function doesn't include the \"script\" parameter. We will have to use the REST API Post directly.\n",
    "\n",
    "The body of the post should be like this.  \n",
    "```\n",
    "{\"job\": {\"account\": \"test\", \"ntasks\": 20, \"name\": \"test18.1\", \"nodes\": [2, 4],\n",
    "\"current_working_directory\": \"/tmp/\", \"environment\": {\"PATH\": \"/bin:/usr/bin/:/usr/local/bin/\",\"LD_LIBRARY_PATH\":\n",
    "\"/lib/:/lib64/:/usr/local/lib\"} }, \"script\": \"#!/bin/bash\\necho it works\"}\n",
    "```\n",
    "When the job is submitted through REST API, it will run as the user \"slurm\". That's what the work directory \"/shared/tmp\" should be owned by \"slurm:slurm\", which is done in the post_install script. \n",
    "\n",
    "To run Athena++, we will need to provide a input file. Without working on the cluster directly, we will need to use a \"fetch and run\" method, where we can pass the S3 locations of the input file and the sbatch script file to the fetch_and_run.sh script. It will fetch the sbatch script and the input file from S3 and put them in /shared/tmp, then submit the sbatch script as a slurm job. \n",
    "\n",
    "As the result, you will see two jobs running: \n",
    "1. fetch_and_run script - sinfo will show that you have one compute node spawning and then running, once the script is completed, it will start the second job \n",
    "2. sbatch script - sinfo will show you "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Program batch script, input and output files\n",
    "\n",
    "To share the pcluster among different users and make sure users can only access their own input and output files, we will use user's ow S3 buckets for input and output files.\n",
    "\n",
    "The job will be running on the ParallelCluster under /efs/tmp (for example) through a fatch (from the S3 bucket) and run script and the output will be stored in the same bucket under \"output\" path. \n",
    "\n",
    "Let's prepare the input file and the sbatch file.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "################## Change these \n",
    "# Where the batch script, input file, output files are uploaded to S3\n",
    "\n",
    "cluster_instance_type = 'c5n.2xlarge'\n",
    "simulation_name = \"orz-256x256\"\n",
    "\n",
    "# for c5n.18xlarge without HyperThreading, the number of cores is 32 - change this accordingly. \n",
    "#num_cores_per_node = 32\n",
    "#partition = \"big\"\n",
    "# for c5n.2xlarge without HyperThreading, the number of cores is 4 - change this accordingly. \n",
    "num_cores_per_node = 4\n",
    "partition = \"big\"\n",
    "# for c5n.2xlarge wit HyperThreading, the number of cores is 4 - change this accordingly. \n",
    "#num_cores_per_node = 8\n",
    "#partition = \"q2\"\n",
    "\n",
    "\n",
    "job_name = f\"{simulation_name}-{cluster_instance_type}\"\n",
    "# fake account_name \n",
    "account_name = f\"dept-2d\" \n",
    "\n",
    "\n",
    "# turn on/off EFA support in the script\n",
    "use_efa=\"NO\"\n",
    "################# END change\n",
    "\n",
    "# prefix in S3\n",
    "my_prefix = \"athena/\"+job_name\n",
    "\n",
    "# template files for input and batch script\n",
    "input_file_ini = \"config/athinput_orszag_tang.ini\"\n",
    "batch_file_ini = \"config/batch_athena_sh.ini\"\n",
    "\n",
    "# actual input and batch script files\n",
    "input_file = \"athinput_orszag_tang.input\"\n",
    "batch_file = \"batch_athena.sh\"\n",
    "    \n",
    "################## Begin ###################################\n",
    "# Mesh/Meshblock parameters\n",
    "# nx1,nx2,nx3 - number of zones in x,y,z\n",
    "# mbx1, mbx2, mbx3 - meshblock size \n",
    "# nx1/mbx1 X nx2/mbx2 X nx3/mbx3 = number of meshblocks - this should be the number of cores you are running the simulation on \n",
    "# e.g. mesh 100 X 100 X 100 with meshsize 50 X 50 X 50 will yield 2X2X2 = 8 blocks, run this on a cluster with 8 cores \n",
    "# \n",
    "\n",
    "#Mesh - actual domain of the problem \n",
    "# 512X512X512 cells with 64x64x64 meshblock - will have 8X8X8 = 512 meshblocks - if running on 32 cores/node, will need \n",
    "# 512/32=16 nodes\n",
    "# \n",
    "nx1=256\n",
    "nx2=256\n",
    "nx3=1\n",
    "\n",
    "#Meshblock - each meshblock size - not too big \n",
    "mbnx1=64\n",
    "mbnx2=64\n",
    "mbnx3=1\n",
    "\n",
    "\n",
    "num_of_threads = 1\n",
    "\n",
    "################# END ####################################\n",
    "\n",
    "#Make sure the mesh is divisible by meshblock size\n",
    "# e.g. num_blocks = (512/64)*(512/64)*(512/64) = 8 x 8 x 8 = 512\n",
    "num_blocks = (nx1/mbnx1)*(nx2/mbnx2)*(nx3/mbnx3)\n",
    "\n",
    "###\n",
    "# Batch file parameters\n",
    "# num_nodes should be less than or equal to the max number of nodes in your cluster\n",
    "# num_tasks_per_node should be less than or equal to the max number of nodes in your cluster \n",
    "# e.g. 512 meshblocks / 32 core/node * 1 core/meshblock = 16 nodes -  c5n.18xlarge\n",
    "num_nodes = int(num_blocks/num_cores_per_node)\n",
    "\n",
    "num_tasks_per_node = num_blocks/num_nodes/num_of_threads\n",
    "cpus_per_task = num_of_threads\n",
    "\n",
    "\n",
    "#This is where the program is installed on the cluster\n",
    "exe_path = \"/shared/athena-public-version/bin/athena\"\n",
    "#This is where the program is going to run on the cluster\n",
    "work_dir = '/shared/tmp/'+job_name\n",
    "ph = { '${nx1}': str(nx1), \n",
    "       '${nx2}': str(nx2),\n",
    "       '${nx3}': str(nx3),\n",
    "       '${mbnx1}': str(mbnx1),\n",
    "       '${mbnx2}': str(mbnx2),\n",
    "       '${mbnx3}': str(mbnx3), \n",
    "       '${num_of_threads}' : str(num_of_threads)}\n",
    "pcluster_helper.template_to_file(input_file_ini, 'build/'+input_file, ph)\n",
    "\n",
    "ph = {'${nodes}': str(num_nodes),\n",
    "      '${ntasks-per-node}': str(int(num_tasks_per_node)),\n",
    "      '${cpus-per-task}': str(cpus_per_task),\n",
    "      '${account}': account_name,\n",
    "      '${partition}': partition,\n",
    "      '${job-name}': job_name,\n",
    "      '${EXE_PATH}': exe_path,\n",
    "      '${WORK_DIR}': work_dir,\n",
    "      '${input-file}': input_file,\n",
    "      '${BUCKET_NAME}': pcluster_helper.my_bucket_name,\n",
    "      '${PREFIX}': my_prefix,\n",
    "      '${USE_EFA}': use_efa,\n",
    "      '${OUTPUT_FOLDER}': \"output/\",\n",
    "      '${NUM_OF_THREADS}' : str(num_of_threads)}\n",
    "pcluster_helper.template_to_file(batch_file_ini, 'build/'+batch_file, ph)\n",
    "\n",
    "# upload to S3 for use later\n",
    "pcluster_helper.upload_athena_files(input_file, batch_file, my_prefix)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submit the job using REST API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "job_script = \"#!/bin/bash\\n/shared/tmp/fetch_and_run.sh {} {} {} {} {}\".format(pcluster_helper.my_bucket_name, my_prefix, input_file, batch_file, job_name)\n",
    "\n",
    "slurm_job_submit_base=slurm_rest_base+'/job/submit'\n",
    "\n",
    "#in order to use Slurm REST to submit jobs, you need to have the working directory permission set to nobody:nobody. in this case /efs/tmp\n",
    "data = {'job':{ 'account': account_name, 'partition': partition, 'name': job_name, 'current_working_directory':'/shared/tmp/', 'environment': {\"PATH\": \"/bin:/usr/bin/:/usr/local/bin/:/opt/slurm/bin:/opt/amazon/openmpi/bin\",\"LD_LIBRARY_PATH\":\n",
    "\"/lib/:/lib64/:/usr/local/lib:/opt/slurm/lib:/opt/slurm/lib64\"}}, 'script':job_script}\n",
    "\n",
    "###\n",
    "# This job submission will generate two jobs , the job_id returned in the response is for the bash job itself. the sbatch will be the job_id+1 run subsequently.\n",
    "#\n",
    "resp_job_submit = pcluster_helper.post_response_as_json(slurm_job_submit_base, data=json.dumps(data))\n",
    "\n",
    "\n",
    "print(resp_job_submit)\n",
    "\n",
    "print(job_script)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List recent jobs\n",
    "\n",
    "You will see one job starting ( then followed by another job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the list of all the jobs immediately after the previous step. This should return two running jobs. \n",
    "slurm_jobs_base=slurm_rest_base+'/jobs'\n",
    "\n",
    "jobs = pcluster_helper.get_response_as_json(slurm_jobs_base)\n",
    "#print(jobs)\n",
    "jobs_headers = [ 'job_id', 'job_state', 'account', 'batch_host', 'nodes', 'cluster', 'partition', 'current_working_directory', 'command']\n",
    "\n",
    "# newer version of slurm \n",
    "pcluster_helper.print_table_from_json_array(jobs_headers, jobs['jobs'])\n",
    "                   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A low resolution simulation will run about few minutes, plus the time for the cluster to spin up. Wait till the job finishes running then move to the next sections. \n",
    "\n",
    "\n",
    "Example of the output : \n",
    "\n",
    "|job_id |job_state |account | batch_host | nodes | cluster |partition |current_working_directory |command |\n",
    "| --- | --- | --- | --- | --- | --- | --- | --- | --- |\n",
    "|2|COMPLETED|dept-2d|q3-dy-c5n2xlarge-1|1|mypc5c|q3|/shared/tmp/| |\n",
    "| 3| CONFIGURING |dept-2d|q3-dy-c5n2xlarge-1 |4 | mypc5c| q3 |/shared/tmp/orz-512x512-c5n.2xlarge|/shared/tmp/orz-512x512-c5n.2xlarge/batch_athena.sh |\n",
    "\n",
    "\n",
    "You can also use ```sinfo``` command on the headnode to monitor the status of your node allocations. \n",
    "\n",
    "<img src=\"images/sinfo.png\" width='500'>\n",
    "\n",
    "### Compute node state\n",
    "\n",
    "AWS ParallelCluster with Slurm scheduler uses Slurm's power saving pluging (see https://docs.aws.amazon.com/parallelcluster/latest/ug/multiple-queue-mode-slurm-user-guide.html). The compute nodes go through a lifecycle with idle and alloc states\n",
    "- POWER_SAVING (~)\n",
    "- POWER_UP (#)\n",
    "- POWER_DOWN (%)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualize Athena++ Simulation Results\n",
    "Now we are going to use the python library comes with Athena++ to read and visualize the simulation results. Simulation data was saved in s3://\\<bucketname\\>/athena/$job_name/output folder. \n",
    "\n",
    "Import the hdf python code that came with Athena++ and copy the data to local file system. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.animation as animation\n",
    "from IPython.display import clear_output\n",
    "import h5py\n",
    "\n",
    "#Do this once. clone the athena++ source code , and the hdf5 python package we need is under vis/python folder\n",
    "\n",
    "if not os.path.isdir('athena-public-version'):\n",
    "    !git clone https://github.com/PrincetonUniversity/athena-public-version\n",
    "else:\n",
    "    print(\"Athena++ code already cloned, skip\")\n",
    "    \n",
    "sys.path.insert(0, 'athena-public-version/vis/python')\n",
    "import athena_read"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "data_folder=job_name+'/output'\n",
    "output_folder = pcluster_helper.my_bucket_name+'/athena/'+data_folder\n",
    "\n",
    "if not os.path.isdir(job_name):\n",
    "    !mkdir -p $job_name\n",
    "else:\n",
    "    !rm -rf $job_name/*\n",
    "\n",
    "!aws s3 cp s3://$output_folder/ ./$data_folder/ --recursive\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Display the hst data\n",
    "History data shows the overs all parameter changes over time. The time interval can be different from that of the hdf5 files.\n",
    "\n",
    "In OrszagTang simulations, the variables in the hst files are 'time', 'dt', 'mass', '1-mom', '2-mom', '3-mom', '1-KE', '2-KE', '3-KE', 'tot-E', '1-ME', '2-ME', '3-ME'\n",
    "\n",
    "All the variables a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "from matplotlib import pyplot as plt\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "hst = athena_read.hst(data_folder+'/OrszagTang.hst')\n",
    "\n",
    "# cannot use this reliably because hst and hdf can have different number of time steps. In this case,we have the same number of steps\n",
    "num_timesteps = len(hst['time'])\n",
    "\n",
    "print(hst.keys())\n",
    "\n",
    "plt.plot(hst['time'], hst['dt'])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading HDF5 data files \n",
    "\n",
    "The hdf5 data files contain all variables inside all meshblocks. There are some merging and calculating work to be done before we can visualizing the result. Fortunately ,Athena++ vis/hdf package takes care of the hard part. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's example the content of the hdf files\n",
    "\n",
    "f = h5py.File(data_folder+'/OrszagTang.out2.00001.athdf', 'r')\n",
    "# variable lists <KeysViewHDF5 ['B', 'Levels', 'LogicalLocations', 'prim', 'x1f', 'x1v', 'x2f', 'x2v', 'x3f', 'x3v']>\n",
    "print(f.keys())\n",
    "\n",
    "#<HDF5 dataset \"B\": shape (3, 512, 64, 64, 64), type \"<f4\"> \n",
    "print(f['prim'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simulation result data \n",
    "\n",
    "Raw athdf data has the following keys\n",
    "<KeysViewHDF5 ['B', 'Levels', 'LogicalLocations', 'prim', 'x1f', 'x1v', 'x2f', 'x2v', 'x3f', 'x3v']>\n",
    "\n",
    "After athena_read.athdf() call, the result contains keys, which can be used as the field name\n",
    "['Coordinates', 'DatasetNames', 'MaxLevel', 'MeshBlockSize', 'NumCycles', 'NumMeshBlocks', 'NumVariables', 'RootGridSize', 'RootGridX1', 'RootGridX2', 'RootGridX3', 'Time', 'VariableNames', 'x1f', 'x1v', 'x2f', 'x2v', 'x3f', 'x3v', 'rho', 'press', 'vel1', 'vel2', 'vel3', 'Bcc1', 'Bcc2', 'Bcc3']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_athdf(filename, num_step):\n",
    "    print(\"Processing \", filename)\n",
    "    athdf = athena_read.athdf(filename)\n",
    "    return athdf\n",
    "\n",
    "# extract list of fields and take a slice in one dimension, dimension can be 'x', 'y', 'z'\n",
    "def read_all_timestep (data_file_name_template, num_steps, field_names, slice_number, dimension):\n",
    "\n",
    "    if not dimension in ['x', 'y', 'z']:\n",
    "        print(\"dimension can only be 'x/y/z'\")\n",
    "        return\n",
    "    \n",
    "    # would ideally process all time steps together and store themn in memory. However, they are too big, will have to trade time for memory \n",
    "    result = {}\n",
    "    for f in field_names:\n",
    "        result[f] = list()\n",
    "        \n",
    "    for i in range(num_steps):\n",
    "        fn = data_file_name_template.format(str(i).zfill(5))\n",
    "        athdf = process_athdf(fn, i)\n",
    "        for f in field_names:\n",
    "            if dimension == 'x':\n",
    "                result[f].append(athdf[f][slice_number,:,:])\n",
    "            elif dimension == 'y':\n",
    "                result[f].append(athdf[f][:, slice_number,:])\n",
    "            else:\n",
    "                result[f].append(athdf[f][:,:, slice_number])\n",
    "                        \n",
    "    return result\n",
    "\n",
    "def animate_slice(data):\n",
    "    plt.figure()\n",
    "    for i in range(len(data)):\n",
    "        plt.imshow(data[i])\n",
    "        plt.title('Frame %d' % i)\n",
    "        plt.show()\n",
    "        plt.pause(0.2)\n",
    "        clear_output(wait=True)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "data_file_name_template = data_folder+'/OrszagTang.out2.{}.athdf'\n",
    "\n",
    "# this is time consuming, try do it once\n",
    "data = read_all_timestep(data_file_name_template, num_timesteps, ['press', 'rho'], 0, 'x')\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cycle through the time steps and look at pressure\n",
    "animate_slice(data['press'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now look at density\n",
    "animate_slice(data['rho'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Don't forget to clean up\n",
    "\n",
    "1. Delete the ParallelCluster\n",
    "2. Delete the RDS\n",
    "3. S3 bucket\n",
    "4. Secrets used in this excercise\n",
    "\n",
    "Deleting VPC is risky, I will leave it out for you to manually clean it up if you created a new VPC. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this is used during developemnt, to reload the module after a change in the module\n",
    "try:\n",
    "    del sys.modules['pcluster_athena']\n",
    "except:\n",
    "    #ignore if the module is not loaded\n",
    "    print('Module not loaded, ignore')\n",
    "from pcluster_athena import PClusterHelper\n",
    "\n",
    "importlib.reload(workshop)\n",
    "\n",
    "#from pcluster_athena import PClusterHelper\n",
    "# create the cluster - # You can rerun the rest of the notebook again with no harm. There are checks in place for existing resoources. \n",
    "pcluster_helper = PClusterHelper(pcluster_name, config_name, post_install_script_prefix)\n",
    "\n",
    "!pcluster delete-cluster --cluster-name $pcluster_helper.pcluster_name --region $REGION\n",
    "\n",
    "keep_key = True\n",
    "pcluster_helper.cleanup_after(KeepRDS=True,KeepSSHKey=keep_key)\n",
    "\n",
    "if not keep_key:\n",
    "    !rm pcluster-athena-key.pem\n",
    "    !rm -rf ~/.ssh/pcluster-athena-key.pem"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}