# Usage Run `make ami_gpu` or `make ami_cpu` to build AMI for GPU with EFA and CPU supporting [pyxies](https://github.com/NVIDIA/pyxis) (see [here](https://github.com/NVIDIA/enroot/blob/9c6e979059699e93cfc1cce0967b78e54ad0e263/doc/cmd/import.md) to configure [AWS ECR](https://aws.amazon.com/ecr/) authentication out of the box ), while `make docker` builds container to use with GPUs and EFA. Run `make deploy` to deploy test cluster in `./test/cluster.yaml` assuming you have credentials in config file with default profile (`${HOME}/.aws`) and different parameters (AMI, subnets, ssh keys) are updated. ## Notes * Review `packer-ami.pkr.hcl` for all available variables. * We are using shared filesystem (`/fsx`) for container cache, set this accordingly to your cluster in `roles/nvidia_enroot_pyxis/templates/enroot.conf` variable `ENROOT_CACHE_PATH`. * Review variables (dependency versions) in `./roles/*/defaults/main.yml` according to [Ansible directory structure](https://docs.ansible.com/ansible/latest/tips_tricks/sample_setup.html). # Preflight Code is in `./preflight` directory. It consists of sanity checks for: * Nvidia GPUs * EFA and Nvidia NCCL * PyTorch ## Notes * `torch.cuda.nccl.version()` in `preflight/preflight.sh` will return built in version, while searching for `NCCL version` if `NCCL_DEBUG=info` is exported will get preloaded version. # using Deep Learning AMI [DLAMI](https://docs.aws.amazon.com/dlami/latest/devguide/what-is-dlami.html) contains common DL dependencies, it can be used with parallel cluster. We can use following configuration: ``` Build: InstanceType: p2.xlarge ParentImage: ami-123 ``` where `ami-123` is ID of DLAMI of your choice. Run [pcluster build-image](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster-v3.html) to add all pcluster dependencies.