#!/bin/python
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
"""Alternative train launcher script for SageMaker Training Compiler

More info at: https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-enable.html

To use SMTC, you'll need to specify a `compiler_config` and set the "GPU_NUM_DEVICES" environment
variable on your Estimator to the number of GPUs per instance for the type you have selected. For
example:

```python
from sagemaker.huggingface import TrainingCompilerConfig

pre_estimator = HuggingFaceEstimator(
    ...,
    compiler_config=TrainingCompilerConfig(),
    env={
        ...,
        "GPU_NUM_DEVICES": "4",  # for ml.p3.8xlarge
    },
)
```

For single-GPU training, you can use the train.py entry_point as usual. However for multi-GPU
training, you'll need to instead set this `entry_point="smtc_launcher.py"` and add an additional
hyperparameter `"training_script": "train.py"`.

This training script has been tested to *functionally* work with SMTC (on Hugging Face v4.11 DLC),
but whether you'll see a useful speed-up may be quite hyperparameter- and use-case-dependent. Note
that a substantial portion of the optimization opportunity with SMTC comes from memory efficiency
allowing larger batch sizes.

Remember also that on p3.16xl and larger where it's supported, enabling SageMaker Distributed Data
Parallel can provide a useful speed boost. When *neither* SMTC nor SMDistributed are enabled, the
HF Trainer API will use PyTorch DataParallel by default (rather than DistributedDataParallel) which
can limit scaling to many GPUs - partly because memory consumption is higher on the "lead" GPU and
so CUDAOutOfMemory will be encountered at lower maximum batch sizes.

Notes from pre-training experiments
-----------------------------------

2,500 document training set (set N_DOCS_KEPT = 2500 in notebook 1) on `ml.p3.8xlarge`, pre-training
with:

- num_train_epochs = 25
- early_stopping_patience = 10
- per_device_eval_batch_size = per_device_train_batch_size
- seed = 42
- warmup_steps = 200

| SMTC | per_device_train_batch_size | learning_rate |      Execution Time  | min val loss |
|:----:|----------------------------:|--------------:|---------------------:|-------------:|
|   No |                          4  |        5e-05  | 5h28m16s (25 epochs) |    0.149301  |
|   No |                          8  |        2e-05  | 4h13m46s (25 epochs) |    0.154481  |
|  Yes |                         20  |        2e-05  |       N/A (GPU OOM)  |   N/A (OOM)  |
|  Yes |                         16  |        1e-04  | 5h03m03s (25 epochs) |    0.147910  |
|  Yes |                         16  |        5e-05  | 5h05m03s (25 epochs) |    0.141911  |
|  Yes |                         16  |        2e-05  | 5h02m52s (25 epochs) |    0.159771  |
|  Yes |                         16  |        1e-05  | 5h01m09s (25 epochs) |    0.191195  |
|  Yes |                         16  |        5e-06  | 5h01m48s (25 epochs) |    0.249820  |
|  Yes |                         12  |        1e-05  | 5h10m35s (25 epochs) |    0.165622  |
|  Yes |                          8  |        2e-05  | 2h50m02s (12 epochs) |  * 0.301963  |
|  Yes |                          8  |        1e-05  | 2h37m52s (11 epochs) |  * 0.627447  |

(*): Training unstable and stopped early after reaching `nan` loss. Best epoch reported.
"""
# Python Built-Ins:
import subprocess
import sys

if __name__ == "__main__":
    arguments_command = " ".join([arg for arg in sys.argv[1:]])
    subprocess.check_call(
        "python -m torch_xla.distributed.sm_dist " + arguments_command, shell=True
    )