{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
Title

\n", "\n", "This skeleton notebook includes reference structure and stanzas, and tips to make notebook as\n", "ergonomic (for both (co-)authors and readers) and camera-ready as possible.\n", "\n", "**NOTE:**\n", "- Best viewed using Jupyter Lab.\n", "- The title is a styled sentence rather than `h1`, to prevent it being showed and numbered in TOC.\n", "\n", "**NOTE:** this skeleton notebook is meant for reading. To run it,\n", "please install additional dependencies imported in the second next cell which starts with line\n", "`# Dependencies required`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "%config InlineBackend.figure_format = 'retina'\n", "%load_ext autoreload\n", "%autoreload 2\n", "\n", "# Make sure my_nb_path is imported first (and when isort is used, it needs to be told).\n", "import my_nb_path # isort: skip\n", "from my_nb_color import print, pprint, inspect" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Dependencies required\n", "import ndpretty\n", "import numpy as np\n", "import pandas as pd\n", "import sagemaker as sm\n", "from IPython.display import Markdown\n", "from loguru import logger\n", "from smallmatter.ds import mask_df # See: https://github.com/aws-samples/smallmatter-package/\n", "\n", "# A few standard SageMaker's stanzas. Use type annotation to be verbose.\n", "role: str = sm.get_execution_role()\n", "sess = sm.Session()\n", "region: str = sess.boto_session.region_name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Global setup\n", "\n", "This section contains Python variables that should be personalized such as:\n", "- the name of Amazon S3 bucket and/or prefix may vary from one project member to another.\n", "- the filename of the dataset to run.\n", "\n", "We also show a pattern to automatically synchronize the Python variable to environment variables.\n", "The idea is to centralized all changes to only this section, then you can safely run the remaining\n", "cells without having to worry about outdated hardcoded values in the Python, `!`, and `%%` codes.\n", "\n", "
Note on heading\n", "\n", "> This section starts with an `h1` heading. Thus, it will appears in the TOC as \"*1. Global setup*\".\n", "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "####################################################################################################\n", "# Change me\n", "####################################################################################################\n", "bucket_name = \"my-bucket-name\"\n", "prefix_name = \"some/prefix\"\n", "####################################################################################################\n", "\n", "\n", "####################################################################################################\n", "# Do not change the next lines, as they're derived and will be recomputed automatically.\n", "####################################################################################################\n", "s3_prefix = f\"s3://{bucket_name}/{prefix_name}\".rstrip(\"/\")\n", "\n", "# Synchronize Python variable and environment variable.\n", "%set_env S3_PREFIX=$s3_prefix\n", "%env S3_PREFIX_CELL_SCOPE=$s3_prefix\n", "\n", "# Demonstrate the difference between %env and %set_env.\n", "!echo $S3_PREFIX_CELL_SCOPE # Should print s3://my-bucket-name/some/prefix\n", "!echo $S3_PREFIX # Should print s3://my-bucket-name/some/prefix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Demonstrate the difference between %env and %set_env\n", "!echo $S3_PREFIX # Should print s3://my-bucket-name/some/prefix\n", "!echo $S3_PREFIX_CELL_SCOPE # Should print an empty string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next cell demonstrates the benefit of having environment variables synchronized to Python variables.\n", "By avoiding hardcoding, you can parameterized your `!` or `%%` commands, to avoid changing hardcoded\n", "values scatterred throughout this notebook.\n", "\n", "*You can also use a raw cell instead of a markdown cell, however do note raw cells may not be\n", "rendered correctly outside of Jupyter Lab.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```bash\n", "####################################################################################################\n", "# Demonstrate the benefit of avoiding hardcoding as much as possible.\n", "####################################################################################################\n", "\n", "# Whenever `bucket_name` and `prefix_name` are updated (i.e., variables are changed and their cell\n", "# are executed), the next CLI is guaranteed to always list the updated Amazon S3 prefix.\n", "!aws s3 ls --recursive $S3_PREFIX/\n", "\n", "\n", "# SEGWAY: since we're talking about aws-cli, here're a few tricks to read Amazon S3 files without\n", "# having to first download and save those files to your local filesystem.\n", "\n", "# Show the first few rows of a file in Amazon S3. NOTE: you can safely ignore the broken pipe error.\n", "!aws s3 cp $S3_PREFIX/haha.txt - | head \n", "\n", "# List files in an archive\n", "!aws s3 cp $S3_PREFIX/model.tar.gz - | tar -tzvf -\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Improved output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Colored outputs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Colored: \u001b[1m{\u001b[0m\u001b[32m'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'\u001b[0m, \u001b[32m'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'\u001b[0m\u001b[1m}\u001b[0m\n", "Colored and wrapped:\n", "\u001b[1m{\u001b[0m\n", " \u001b[32m'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\u001b[0m\n", "\u001b[32mAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\u001b[0m\n", "\u001b[32mAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'\u001b[0m,\n", " \u001b[32m'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\u001b[0m\n", "\u001b[32mBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\u001b[0m\n", "\u001b[32mBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'\u001b[0m\n", "\u001b[1m}\u001b[0m\n", "\n", "\u001b[1m{\u001b[0m\n", " \u001b[32m'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'\u001b[0m,\n", " \u001b[32m'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32m2022-01-22 17:23:03.529\u001b[0m | \u001b[34m\u001b[1mDEBUG \u001b[0m | \u001b[36m__main__\u001b[0m:\u001b[36m\u001b[0m:\u001b[36m7\u001b[0m - \u001b[34m\u001b[1mHello World!\u001b[0m\n", "\u001b[32m2022-01-22 17:23:03.530\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36m__main__\u001b[0m:\u001b[36m\u001b[0m:\u001b[36m7\u001b[0m - \u001b[1mHello World!\u001b[0m\n", "\u001b[32m2022-01-22 17:23:03.531\u001b[0m | \u001b[32m\u001b[1mSUCCESS \u001b[0m | \u001b[36m__main__\u001b[0m:\u001b[36m\u001b[0m:\u001b[36m7\u001b[0m - \u001b[32m\u001b[1mHello World!\u001b[0m\n", "\u001b[32m2022-01-22 17:23:03.532\u001b[0m | \u001b[31m\u001b[1mERROR \u001b[0m | \u001b[36m__main__\u001b[0m:\u001b[36m\u001b[0m:\u001b[36m7\u001b[0m - \u001b[31m\u001b[1mHello World!\u001b[0m\n" ] } ], "source": [ "d = {\"A\" * 200, \"B\" * 200}\n", "print(\"Colored:\", d)\n", "pprint(\"Colored and wrapped:\", d)\n", "display(d)\n", "\n", "for f in (logger.debug, logger.info, logger.success, logger.error):\n", " f(\"Hello World!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataframes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "### Plain dataframe\n", "**NOTE:** this also appears in TOC as \"*2.2.1. Plain dataframe*\"" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab
014
125
236
\n", "
" ], "text/plain": [ " a b\n", "0 1 4\n", "1 2 5\n", "2 3 6" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/markdown": [ "### Masked dataframe\n", "Sometime, we would like to version the output of this cell into the git repo, to help readers to\n", "quickly see the shape of a dataframe.\n", "\n", "However, when the dataframe contains sensitive values, care must be taken to\n", "**NEVER** version these values to git.\n", "Otherwise, as you all know, once checked into the git history, it can be tedious and challenging to\n", "undo the versioning.\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
useridpca_apca_b
0xxx0.1-0.30
1xxx0.20.01
2xxx0.30.70
\n", "
" ], "text/plain": [ " userid pca_a pca_b\n", "0 xxx 0.1 -0.30\n", "1 xxx 0.2 0.01\n", "2 xxx 0.3 0.70" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def mask_userid(df: pd.DataFrame) -> pd.DataFrame:\n", " return mask_df(df, cols=[\"userid\"])\n", "\n", "\n", "df_a = pd.DataFrame(\n", " {\n", " \"a\": [1, 2, 3],\n", " \"b\": [4, 5, 6],\n", " }\n", ")\n", "df_b = pd.DataFrame(\n", " {\n", " \"userid\": [1000, 2000, 3000], # Illustration only. Usually read from somewhere.\n", " \"pca_a\": [0.1, 0.2, 0.3],\n", " \"pca_b\": [-0.3, 0.01, 0.7],\n", " }\n", ")\n", "\n", "display(\n", " Markdown(\n", " '### Plain dataframe\\n**NOTE:** this also appears in TOC as \"*2.2.1. Plain dataframe*\"'\n", " ),\n", " df_a,\n", " Markdown(\n", " \"\"\"### Masked dataframe\n", "Sometime, we would like to version the output of this cell into the git repo, to help readers to\n", "quickly see the shape of a dataframe.\n", "\n", "However, when the dataframe contains sensitive values, care must be taken to\n", "**NEVER** version these values to git.\n", "Otherwise, as you all know, once checked into the git history, it can be tedious and challenging to\n", "undo the versioning.\n", "\"\"\"\n", " ),\n", " mask_userid(df_b),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pretty ndarray display\n", "\n", "Example: display a 2D tensor." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Affect globally\n", "ndpretty.default()\n", "np.random.rand(9, 9)\n", "\n", "# NOTE: without ndpretty.default(), use this form:\n", "# ndpretty.ndarray_html(np.random.rand(3, 4))\n", "\n", "# NOTE: the rendered output won't persist." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Summary\n", "\n", "When this notebook should be versioned without output, do a *Clear All Outputs*.\n", "\n", "When there're output to be version (like what this skeleton notebook does), consider to remove the\n", "cell counts.\n", "\n", "**Advance git tips:** selectively choose which hunk (i.e., portion) of unstaged change to stage\n", "using `git add -i filename.ipynb`. For instance, you may edit this notebook on another machine with\n", "a different Python version or environment name, and these minutae changes need not be committed.\n", "Please refer to git documentation for more details. TLDR version: choose `5: patch`, then decide\n", "what to do with each hunk. Best combined with `nbdiff` for a more intuitive diff view.\n", "\n", "
Footnote\n", "\n", "> This skeleton notebook was ran through the\n", "> [clr-nb-xcnt.sh](https://github.com/verdimrc/pyutil/blob/master/bin/clr-nb-xcnt.sh) bash script\n", "> to clear its cell counts.\n", ">\n", "> DISCLAIMER: the script is provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS\n", "> OF ANY KIND, either express or implied, including, without limitation, any warranties or\n", "> conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You\n", "> are solely responsible for determining the appropriateness of using or redistributing the Work and\n", "> assume any risks associated with Your exercise of permissions under this Apache License 2.0.\n", "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Environment (virtualenv_ds-p310)", "language": "python", "name": "virtualenv_ds-p310" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" }, "toc-autonumbering": true, "toc-showcode": false, "toc-showmarkdowntxt": false, "toc-showtags": false }, "nbformat": 4, "nbformat_minor": 4 }