## Experiment configuration file Let's discuss the structure of experimentation config on the example of `experiments/blog-benchmarks-all.yml`. ### Main experiment settings The experiment config begins by setting a couple of essential experiment parameters: - *name* -- defines the name of the experiment under which it will be tracked in SageMaker Experiments. The name must be unique within an account. - *description* -- a string that describes the experiments. - *role* -- the name of AWS IAM role that will be used by Amazon SageMaker training jobs to, for example, access training data and model artifacts. - *region* -- the name of the AWS region for experiment jobs and other artifacts. - *bucket* -- an S3 bucket name where to store experiment artifacts in. Set to `null` to use a default bucket name (recommended). - *output_prefix* -- an S3 bucket prefix to store experiment artifacts in. - *parallelism* -- defines how many SageMaker training jobs will be run concurrently at any point of time. Set this up accordingly to stay within the [SageMaker Training quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) (note that you can also [request SageMaker quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) as needed). Set to `0` to disable concurrent execution, and run experiment sequantially (e.g., for debugging). - *repeat* -- specifies how many times to repeat each trial. - *disable_profiler* -- specifies whether [SageMaker Debugger monitoring and profiling](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html) will be enabled or disabled. - *order* -- can be either `default`, `interleave`, or `random`. Specifies the order to run the (`repeat`ed) trials (e.g., to reduce resource congestion). - *clean* -- set to `True` to automatically delete all previous trials and start this experiment from a clean slate. Set to `False` in order to append new trials to an existing experiment. Excerpt from `blog-benchmarks-all.yml`: ``` name: blog-benchmarks-all description: 'Benchmarks -- Blog: Choose the best data source for your Amazon SageMaker training job' role: SageMakerRoleBenchmark region: us-east-1 bucket: null output_prefix: experiments parallelism: 6 repeat: 3 disable_profiler: True order: interleave clean: True ``` ### Dataset definitions The dataset is defined under `datasets` first-level key of the experiment config file and can either be a `*.yml`-file itself with dataset definitions (that will be parsed upon experiment initialization), or a dictionary where every key is the *name* of the dataset, and the value is another dictionary with *dataset properties*. The *dataset properties* indicate the *type* of the dataset and where to find it. If the dataset is not found at the specified S3 bucket/prefic location, and its *type* is either `synthetic` or `caltech` (you can also implement your own new dataset type), then the dataset will either be automatically downloaded (in case of `caltech`), or synthesized locally (in case of `synthetic`), and then prepared according to the other *dateset properties* before being uploaded to the specified S3 bucket/prefix. The benchmark experiment will start automatically afterwards. That is, building of the dataset is normally required only once. Currently supported dataset types (specified by `type` parameter in the *dataset properties*) are `caltech`, `synthetic`, and `s3prefix`. The latter (`s3prefix`) allows to use any custom dataset the has been already stored on S3 -- in this case one needs to provide S3 `bucket` name, as well as S3 `prefix` where the dataset is stored. For example, the snippet below defines two datasets with names `MyTrainDatasetSplit` and `MyTestDatasetSplit` that are stored under `s3://sagemaker-benchmark-us-east-1-24271126XXXX/datasets/MyTrainDatasetSplit` and `s3://sagemaker-benchmark-us-east-1-24271126XXXX/datasets/MyTestDatasetSplit`, respectively. ``` datasets: MyTrainDatasetSplit: type: s3prefix bucket: sagemaker-benchmark-us-east-1-24271126XXXX prefix: 'datasets' MyTestDatasetSplit: type: s3prefix bucket: sagemaker-benchmark-us-east-1-24271126XXXX prefix: 'datasets' ``` ### Trial definitions A *base trial* specifies parameters that by default apply to all other trials, *unless overridden or extended in the trial definition later*. In other words, the specifications of *each single trial* inherit from the *base trial*, but each single trial can *override* or *add any extra parameters* to the settings specified in base trial. For example, this excerpt from `blog-benchmarks-all.yml` below defines 3 trials with all but two identical parameters (`dataset` and `input_mode`). The rest of the parameters are inherited from *base trial*, and are thus identical. ``` base_trial: script: script-model.py source_dir: scripts framework: tf framework_version: '2.4.1' py_version: py37 volume_size: 500 instance_type: ml.p3.2xlarge instance_count: 1 max_run: 36000 inputs: train: dataset: input_mode: hyperparameters: key1: value1 key2: value2 (...) trials: - inputs: train: dataset: Caltech-jpg-1x input_mode: file - inputs: train: dataset: Caltech-jpg-1x input_mode: fsx - inputs: train: dataset: Caltech-jpg-1x input_mode: ffm (...) ``` The base trial in this case defines several parameters for the trials: - *script* -- the name of the local training script that will be executed as the entry point to SageMaker training job. If `source_dir` is specified, then `script` must point to a file located at the root of `source_dir`. - *source_dir* -- the path to a directory entry point script and any other dependencies. - *framework* -- can be either `tf` for TensorFlow, `pt` for PyTorch, or `mx` gfor MXNet. - *framework_version* -- version of the framewort you want to use for executing your model training code. - *py_version* -- Python version you want to use for executing your model training code. - *volume_size* -- size in GB of the EBS volume to use for storing input data during training. Must be large enough to store training data if File mode is used. - *instance_type* -- Type of EC2 instance to use for training, for example, ‘ml.p3.2xlarge’. - *instance_count* -- Number of Amazon EC2 instances to use for training. - *max_run* -- Timeout in seconds for training (default: 24 * 60 * 60). After this amount of time Amazon SageMaker terminates the job regardless of its current status. - *hyperparameters* -- a dictionary with any custom hyperparameters that will be passed to the training script. - *inputs* -- a dictionary, where each key is the name of the input channel name, and the value is a a dictionary with settings for that input channel. For example, the snippet below defines two input channels (named `train` and `test`), which will read `MyTrainDatasetSplit` and `MyTestDatasetSplit`, respectively (defined earlier under `datasets`) using Fast File Mode (`ffm`) in both cases. Accepted native SageMaker input modes: `file`, `pipe`, `ffm` and `fsx`. ``` inputs: train: dataset: MyTrainDatasetSplit input_mode: ffm test: dataset: MyTestDatasetSplit input_mode: ffm ```