{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Distributed RL Training Using Ray Framework with Cart Pole v1 Environment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this SageMaker Notebook we will walkthrough an example using the Cart Pole v1 Reinforcement Learning (RL) use case. This is a classic RL use case where the environment is created within the [OpenAI gym toolkit](https://github.com/openai/gym). You can read more about the Cart Pole use case and refer to the initial research paper by Barto, Sutton, and Anderson in the following [Gym Documentation](https://www.gymlibrary.dev/environments/classic_control/cart_pole/). \n", "\n", "The Ray framework is able to load the Cart Pole v1 environment using available algorithms from the Ray Reinforcement Learning library known as [RLlib](https://docs.ray.io/en/latest/rllib/index.html). You can find available RLlib algorithms [here](https://docs.ray.io/en/latest/rllib/rllib-algorithms.html). These algorithms have the ability to load the Cart Pole v1 environment with default configurations. This gives the user the ability to adapt original RL use cases to state of the art RL algorithms. \n", "\n", "In our example, we use the [PPO](https://docs.ray.io/en/latest/rllib/rllib-algorithms.html#ppo) policy gradient based algorithm within Ray's RLlib library to load the Cart Pole v1 environment and run 10 iterations to train a RL model within a SageMaker Training Job. Later we adapt this example to a SageMaker Hyperparameter Optimization (HPO) Tuning job to demonstrate tuning hyperparameters given a RL use case. \n", "\n", "Furthermore, to enable you to use Ray RLlib at scale we show how to extend a SageMaker Deep Learning Container (DLC), with the TensorFlow framework, with Ray to execute an Amazon SageMaker Training Job using a cluster of EC2 instances to distribute wworker tasks among your CPUs and reduce training time (both for CPU or GPU instance types). \n", "\n", "The key goals from this demonstration are:\n", "* Demonstrate how to extend a SageMaker DLC with Ray to power your RL use case and to use for both SM Training and HPO jobs\n", "* Provide a *Getting Started* example leveraging the RLlib library which you can adapt to your own use case\n", "* Show how to use Ray's tune class to output a report of key RL metrics among each training iteration in CloudWatch Logs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Important Scripts\n", "\n", "You will find a *src* path within this directory with 3 files:\n", "\n", "1. __requirements.txt__ \n", " * This is the requirements file that contains the python packages required at run time to extend the SageMaker DLC. You will find pacakges such as Ray and tensorflow dependencies required for Ray. \n", "2. __sagemaker_ray_helper.py__ \n", " * This is a helper script that will help SageMaker initiate Ray amongst the EC2 instance(s). This will enable the Training Job to distribute the Ray tasks across the workers of the Ray cluster. You will call classes from this script in your entrypoint script. \n", "3. __train_cart_pole.py__ \n", " * This is the entrypoint script for the RL training job. It will call __sagemaker_ray_helper.py__ to initiate the Ray cluster, import Ray python classes, load the Cart Pole environment, kick-off training iterations, and use Ray's tune class to report RL metrics each training iteration. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.image_uris import get_training_image_uri\n", "from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Review image uris to choose from for DLC__\n", "Here we review an image uri to use. This will help us select the python version, tensorflow framework, and instance type later when we use the TensorFlow SageMaker Estimator class\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "image_uri = get_training_image_uri(framework=\"tensorflow\", \n", " region=\"us-east-1\",\n", " py_version=\"py39\",\n", " framework_version=\"2.8\",\n", " instance_type=\"ml.m5.4xlarge\"\n", " )\n", "\n", "print(f'image uri: {image_uri}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train model with TensorFlow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change bucket name below if needed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import uuid\n", "import sagemaker\n", "from sagemaker.tensorflow import TensorFlow\n", "\n", "sess = sagemaker.session.Session()\n", "role = sagemaker.get_execution_role()\n", "bucket = sagemaker.session.Session().default_bucket() # change bucket name here if needed\n", "key_prefix = f\"{uuid.getnode()}/distributed_rl\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below are metric definitions which will be reported in the CloudWatch Logs during the SageMaker Training Job. Ray will capture these metrics for each training iteration. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metric_definitions = [\n", " {'Name': 'episode_reward_mean', 'Regex': 'episode_reward_max: ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'},\n", " {'Name': 'episode_reward_max', 'Regex': 'episode_reward_mean: ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'}, \n", " {'Name': 'episode_reward_min', 'Regex': 'episode_reward_min: ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'},\n", " {'Name': 'episodes_total', 'Regex': 'episodes_total: ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'}, \n", " {'Name': 'training_iteration', 'Regex': 'training_iteration: ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'},\n", " {'Name': 'timesteps_total', 'Regex': 'timesteps_total: ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'}\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our example, we will setup a homogeneous cluster of 5 m5.4xlarge instances. __You may change this given your Amazon SageMaker Service Quota limit__. \n", "\n", "Since there are 16 vCPUs via m5.4xlarge we will specify the number of workers to be (16 * *number of instances*) - 1. We subtract one due to saving a CPU for overhead purposes. \n", "\n", "The required hyperparameter arguments are:\n", "* *num-workers* - Ray number of workers configuration (see sentence above)\n", "* *framework* - We use tensorflow so we specify \"tf\". If you change this to \"torch\", you will need to change the DLC (the TensforFlow Estimator framework will not work below) and you will need to modify the __train_cart_pole.py__ script to work with PyTorch along with your __requirement.txt__ file. \n", "* *train-iterations* - We use 10 as a default, increasing this number will increase the length of the training job.\n", "\n", "We choose several training hyperparameters for the PPO algorithm such as LR, gamma, kl coefficient, and number of SGD iterations. There are many more and you can refer to the PPO algorithm link earlier to review additional training parameters. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note* - In our example we use a CPU instance type. You may use a GPU (e.g. ml.p3.2xlarge) instance given your use case or for testing. If you do use a GPU instance, be cognizant of your *num-workers* parameter and please refer to the [RLlib Scaling guide](https://docs.ray.io/en/latest/rllib/rllib-training.html#rllib-scaling-guide) for best recommendationsto configure *num_workers* and *num_gpus_per_worker*. You may need to modify the configurations in the __train_cart_pole.py__ file. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# these variables are configured for a CPU instance types\n", "number_of_instances = 5\n", "instance_type = \"ml.m5.4xlarge\"\n", "number_of_cpus = 16 # 16 vCpus per m5.4xlarge" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Training with TensorFlow\n", "tb_logging_path = f\"s3://{bucket}/{key_prefix}/tb_logs/tf\"\n", "tf_estimator = TensorFlow(\n", " source_dir = \"src\",\n", " entry_point=\"train_cart_pole.py\",\n", " role=role,\n", " instance_count=number_of_instances,\n", " metric_definitions=metric_definitions,\n", " hyperparameters={\"num-workers\":f\"{(number_of_cpus * number_of_instances)-1}\", \n", " \"framework\":\"tf\",\n", " \"train-iterations\": \"10\",\n", " \"lr\": \".001\",\n", " \"gamma\": \"0.99\",\n", " \"kl_coeff\": \"1.0\",\n", " \"num_sgd_iter\": \"20\"\n", " },\n", " instance_type=instance_type, # try with m5.4xlarge\n", " framework_version=\"2.8\",\n", " py_version=\"py39\",\n", " checkpoint_s3_uri=tb_logging_path,\n", " keep_alive_period_in_seconds=1800\n", ")\n", "\n", "tf_estimator.fit(wait=True) # change wait=False if you do not want to see the logs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When Executing the Job, notice the format of the output for each iteration. This is based from Ray's tune.report class." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HPO Job\n", "\n", "Now we will setup a SageMaker HPO job to optimize the hyperparameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define arbitrary ranges\n", "hp_ranges = {\n", " \"lr\": ContinuousParameter(0.001, 0.01),\n", " \"gamma\": ContinuousParameter(0.8, 0.99),\n", " \"kl_coeff\": ContinuousParameter(0.3, 1.0),\n", " \"num_sgd_iter\": IntegerParameter(10,50)\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we create a SageMaker Hyperparameter Tuning Job. \n", "\n", "You may change the max number of jobs and max parallel jobs given your service quota limits. For this example, we choose max training jobs of 8 and max parrallel jobs of 2. Hence, you must be able to run 10 m5.4xlarges at 1 time. \n", "\n", "For the Cart Pole problem, our goal is to maximize the episode reward mean. 500 is the maximum value of reward. So if you maximize at this value, then you have reached the optimal reward. This is expected behaviour " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner = HyperparameterTuner(\n", " estimator=tf_estimator,\n", " objective_metric_name='episode_reward_mean',\n", " objective_type='Maximize',\n", " metric_definitions=metric_definitions,\n", " hyperparameter_ranges=hp_ranges,\n", " max_jobs=8,\n", " max_parallel_jobs=2,\n", " base_tuning_job_name='byoc-cart-pole'\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner.fit(wait=False) # To reduce logs, we recommend setting this to True and reviewing logs in AWS console" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }