.. _header-n795: Using SageMaker Containers for SageMaker Training ================================================= This document discusses in detail how to use SageMaker Containers for training. .. contents:: .. _header-n860: Building and training your own algorithm container using SageMaker Containers ----------------------------------------------------------------------------- SageMaker Containers makes easier the process to bring your own container (BYOC). In this scenario, ``SAGEMAKER_PROGRAM``, containing the name of the entry point script located under ``/opt/ml/code`` folder is the only environment variable required. Alternatively, a hyperparameter named ``sagemaker_program`` can be used. The workflow to train a BYOC container is a follows: .. figure:: https://www.websequencediagrams.com/cgi-bin/cdraw?lz=dGl0bGUgQllPQyB0cmFpbmluZyB3b3JrZmxvdwoKClNhZ2VNYWtlciBUABYIU2VydmljZS0-AA0FZXI6ADMGCgphY3RpdmF0ZQAlBmVyCgpub3RlIG92ADYIZXIKCi0gcmVhZCBlbnRyeSBwb2ludCBmcm9tIAplbnZpcm9ubWVudCB2YXIgClNBR0VNQUtFUl9QUk9HUkFNCgplbmQgbm90ZQoKCgB2BwCAfwdpbmdFbnYAgQMHaW5nX2VudigpAIEDEAAgBgCBBhEADwgAgRQHdXNlciBoeXBlcnBhcmFtZXRlcnMAgS8IaW5wdXQgZGF0YSBjb25maWd1cmF0aW9uAIFPCHJlc291cmNlAAgWUHl0aG9uIFNESyBzZXR0dGluZwBRCWhhcmR3YXJlIGluZm9ybQBQBgCBYBEAgWIGLQCCbBAAgWcLZGUAgU8lZXIKLSBjcmVhdGUgc2NyaXB0IGFyZ3MAgwYGaHBzCi0gd3JpdGUAhAsKZW52AIEXBSBhcwAHBXZhcnMAgwYLAIMHCUUAg1UFUG9pbnQ6IHJ1bigAg2YFAINlBSwgZW52X3ZhcnMsAGkFAIMVDAAsCyAAhC0LAEMLCgotIHByZXBhcmUAhDEOb3IAhTIJCi0gZXhlY3V0ABYNAIQtDACFAhE6IC0gcmVwb3J0IHN1Y2Nlc3Mgb3IgZmFpbHVyZQoKCgCBRQsAgmwMcnVuAIJgDQCBHA0K&s=default :alt: byoc training workflow SageMaker invokes the CLI binary `train `__ when training starts. This binary invokes `trainer.train() `__, the function responsible for creating the training environment, executing the entry point, and reporting results. **Training environment** creation is encapsulated by `training_env() `__ function call, this function returns an `TrainingEnv `__ object. The ``TrainingEnv`` provides access to aspects of the training environment relevant to training jobs, including hyperparameters, system characteristics, filesystem locations, environment variables and configuration settings. It is a read-only snapshot of the container environment during training and it doesn't contain any form of state. Example on how a script can use TrainingEnv: .. code:: python import sagemaker_containers env = sagemaker_containers.training_env() # get the path of the channel 'training' from the ``inputdataconfig.json`` file training_dir = env.channel_input_dirs['training'] # get a the hyperparameter 'training_data_file' from ``hyperparameters.json`` file file_name = env.hyperparameters['training_data_file'] # get the folder where the model should be saved model_dir = env.model_dir data = np.load(os.path.join(training_dir, file_name)) x_train, y_train = data['features'], keras.utils.to_categorical(data['labels']) model = ResNet50(weights='imagenet') ... model.fit(x_train, y_train) #save the model in the end of training model.save(os.path.join(model_dir, 'saved_model')) **Entry point execution** is encapsulted by `entry_point.run(uri, user_entry_point, args, env_vars) `__, it prepares and executes the user entry point, passing ``env_vars`` as environment variables and ``args`` as command arguments. If the entry point is: - **A Python script:** executes the script as ``ENV_VARS python entry point_name ARGS`` - **Any other script:** executes the command as ``ENV_VARS /bin/sh -c ./module_name ARGS`` Usage example: .. code:: python import sagemaker_containers from sagemaker_containers.beta.framework import entry_point env = sagemaker_containers.training_env() # {'channel-input-dirs': {'training': '/opt/ml/input/training'}, 'model_dir': '/opt/ml/model', ...} # reading hyperparameters as a dictionary hyperparameters = env.hyperparameters # {'batch-size': 128, 'model_dir': '/opt/ml/model'} # reading hyperparameters as script arguments args = env.to_cmd_args(hyperparameters) # ['--batch-size', '128', '--model_dir', '/opt/ml/model'] # reading the training environment as env vars env_vars = env.to_env_vars() # {'SAGEMAKER_CHANNELS':'training', # 'SAGEMAKER_CHANNEL_TRAINING':'/opt/ml/input/training', # 'MODEL_DIR':'/opt/ml/model', ...} # executes user entry point named user_script.py as follow: # # SAGEMAKER_CHANNELS=training SAGEMAKER_CHANNEL_TRAINING=/opt/ml/input/training \ # SAGEMAKER_MODEL_DIR=/opt/ml/model python user_script.py --batch-size 128 --model_dir /opt/ml/model entry_point.run('user_script.py', args, env_vars) If the entry point execution fails, ``trainer.train()`` will write the error message to ``/opt/ml/output/failure``. The entry point touches the sucess file under ``/opt/ml/success`` otherwise. .. _header-n814: Training a Framework container ------------------------------ `TensorFlow `__, `MXNet `__, `PyTorch `__, `Chainer `__, and `Scikit-Learn `__ are **Framework Containers**. One difference between a **Framework Container** and a **BYOC** is while the latter includes the entry point under ``/opt/ml/code``, the former doesn't include the user entry point and needs to download it from S3. The workflow is as follows: .. figure:: https://www.websequencediagrams.com/cgi-bin/cdraw?lz=dGl0bGUgRnJhbWV3b3JrIENvbnRhaW5lciB0cmFpbmluZyB3b3JrZmxvdwoKClNhZ2VNYWtlciBQeXRob24gU0RLLT4ADApUACwIU2VydmljZTogRXN0aW1hdG9yKGVudHJ5cG9pbnQsIGt3YXJncyoqKS5maXQoY2hhbm5lbHMpCgoKCmFjdGl2YXRlIABYFApub3RlIG92ZXIACRYKLSBjcmVhdGVzIGEgY3VzdG9tZXIgUzMgYnVja2V0IGlmIG5lY2Vzc2FyeQotIGNvbXByZXNzIACBFwUgAIEYBSBpbiBzb3VyY2UudGFyLmd6CgplbmQgAHQFAIFoGTM6IHVwbG9hZHMALg4gdG8AfQoAfgYKCgCBJSAAgTkLZGRpdGlvbmFsIGh5cGVycGFyYW1ldGVycwotIHN0YXJ0IHRoZQCDIApqb2IKAIEZFgCDEQgAgxEHLQCDKQwAg0EKOgoKCmRlAIJkHgCDeQwAg18QLT4Ag3wFZXI6AIQ4BgCDQwsAEgcKCgoKAB0HACYHaW5nRW52ACoHaW5nX2VudigpACoQACAGAIIvDAAKDS0gcmVhZACCFRAgcHJvdmlkZWQgYnkAhSoHAIJaBgAoBXVzZXIAgkQTAEQFaW5wdXQgZGF0YSBjb25maWd1cmF0aW9uAGEIcmUAhB0GAAgWcACGEwkgc2V0dHRpbmcAUQloYXJkd2FyZSBpbmZvcm0AUAYAhFkJAIIWCACCDwYtAIJGCgCDCw0AgXYcZXIAhXsJIHNjcmlwdCBhcmdzIGZyb20gaHBzCi0gd3JpdACEKgtlbnYAgQgFIGFzAAcFdmFycwCBBgsAgyUJRQCGHwVQb2ludDogcnVuAIdEDWVudl92YXJzLABpBSkAhzwMAC0LIAoKClMzLQA-DkRvd24AhicUAIEtBQCGIRwAgQYLCgotIHByZXBhcmUAhz8NZm8AiT4KCi0gZXhlY3V0ABYNAIJwDACCOhE6IC0gcmVwb3J0IHN1Y2Nlc3Mgb3IgZmFpbHVyZQoKCgCCCAsAgxQYAIEYDQCDNBFlcgoK&s=default :alt: framework containers training workflow The subsections below will detail the integration between the SageMaker Python SDK and SageMaker Containers and how to create a framework container. .. _header-n819: Integration between the SageMaker Python SDK and SageMaker Containers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When the SageMaker Python SDK is used to create a training job with a framework containers, it passes special hyperparameters to the training job, which are parsed by SageMaker Container and the framework containers. For example: .. code:: python from sagemaker.tensorflow import TensorFlow model_dir = 's3://SAGEMAKER-BUCKET/hvd-job-377/model' mpi_distribution = { 'mpi': { 'enabled': True, 'custom_mpi_options': '-x HOROVOD_HIERARCHICAL_ALLREDUCE=1', 'processes_per_host': 8}} estimator = TensorFlow(entry_point='train_horovod_imagenet.sh', model_dir=model_dir, hyperparameters={'lr': 0.3}, distributions=mpi_distribution, ...) When a training job is created using the estimator above, i.e. ``estimator.fit()`` is called, the Python SDK will create additional hyperparameters and invoke the training job as follow: .. code:: python import boto3 job_hyperparameters = { # user provided hyperparameters 'lr': '0.3', # hyperparameters created by the Python SDK and used by SageMaker Containers 'sagemaker_job_name': 'JOB_NAME', 'sagemaker_program': 'train_horovod_imagenet.sh', 'sagemaker_region': 'us-west-2', 'sagemaker_submit_directory': 's3://SAGEMAKER-BUCKET/JOB_NAME/source.tar.gz' 'sagemaker_container_log_level': '20', 'sagemaker_mpi_enabled': 'true', 'sagemaker_mpi_num_of_processes_per_host': '8', # hyperparameters created by the Python SDK and used by the TF container 'model_dir': 's3://SAGEMAKER-BUCKET/hvd-job-377/model' } boto3.client('sagemaker').create_training_job(HyperParameters=job_hyperparameters, ...) As you can see in the example, in addition to user-provided hyperparameters, the SageMaker Python SDK includes hyperparameters that will be used by SageMaker Containers and or the framework container. The most important SageMaker hyperparameters for training are: - ``sagemaker_program``: name of the user-provided entry point, it is **mandatory** unless environment variable ``SAGEMAKER_PROGRAM`` is provided. - ``sagemaker_submit_directory``: local or S3 URI location of the source.tar.gz file containing the entry point code. It is **mandatory** unless the code is already located under the ``/opt/ml/code`` folder. The complete list of hyperparameters is available `here `__. .. _header-n831: How to create a framework container for training ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A framework container is composed by a Dockerfile and framework-specific logic. Let's see the MXNet container as an example: .. _header-n833: Creating the Dockerfile ^^^^^^^^^^^^^^^^^^^^^^^ **Dockerfile** .. code:: docker FROM mxnet/python # install SageMaker Containers and SageMaker MXNet Container RUN pip install sagemaker-containers sagemaker_mxnet_container # set sagemaker_mxnet_container.training.main as framework entry point ENV SAGEMAKER_TRAINING_MODULE sagemaker_mxnet_container.training:train In the example above, MXNet and Python libraries are already installed in the base container. The framework container only needs to install SageMaker Containers and the SageMaker MXNet container package. The environment variable ``SAGEMAKER_TRAINING_MODULE`` determines that the function ``train`` under the module ``training`` of the container package is going to be invoked when the container starts. **The training package** .. code:: python from sagemaker_containers.beta import framework # name of the user entry point from sagemaker hyperparameter user_entry_point = env.module_name # local or S3 URI location of the source.tar.gz file module_dir = env.module_dir def train(env): env = framework.training_env() framework.entry point.run(module_dir, user_entry_point, env.to_cmd_args(), env.to_env_vars()) The code above covers everything necessary for single training using MXNet. The following example includes framework specific logic required for distributed training. .. code:: python def train(env): env = framework.training_env() ps_port = '8000' # starts the MXNet scheduler only in the first instance if env.current_host == 'algo-1': _run_mxnet_process('scheduler', env.hosts, ps_port) # starts MXNet parameter server in all instances _run_mxnet_process('server', env.hosts, ps_port) framework.entry point.run(module_dir, user_entry_point, env.to_cmd_args(), env.to_env_vars()) The implementation of ``run_mxnet_process`` can be found `here `__. The example above starts the mxnet **scheduler** in the first instance and starts the mxnet **server** in all instances.