---
title : "b. Create a distributed ML cluster"
date: 2020-09-04T15:58:58Z
weight : 10
tags : ["configuration", "ML", "ParallelCluster", "create", "cluster"]
---

In this step, you create a cluster configuration that supports your Distributed Machine Learning task.

{{% notice note %}}
If you are not familiar with AWS ParallelCluster, EFA and FSx, we recommend that you first complete the [AWS Amazon FSx for Lustre lab](../04-amazon-fsx-for-lustre.html) and [AWS EFA lab](../07-efa.html) before proceeding.
In particular, you need to be able to [examine the FSx file system](../04-amazon-fsx-for-lustre/03-check-fs.html) and [examine the EFA enabled instance](../07-efa/02-check-efa.html).
The use of [NICE DCV](https://aws.amazon.com/hpc/dcv/) to interact with the cluster through a remote desktop is optional. Check out the [Remote Visualization using NICE DCV lab](../06-nice-dcv.html) for more information.
{{% /notice %}}

#### Create a Cluster Configuration File

This section assumes that you are familiar with AWS ParallelCluster and the process of bootstrapping a cluster.

Let us reuse the [**SSH key-pair**](/02-aws-getting-started/05-key-pair-create.html) created earlier.

The cluster configuration that you generate for training large scale ML models includes constructs from EFA and FSx that you can explore in the previous sections of this workshop. The main additions to the cluster configuration script are:

- Set the compute nodes as [p3dn.24xlarge instances](https://aws.amazon.com/ec2/instance-types/). The p3dn.24xlarge is one of the [EFA supported instance types](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types) with multiple GPUs.
- Set the cluster initial size to 0 compute nodes and maximum size to 2 instances. The cluster uses [Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html) that will grow and shrink between the min and max limits based on the cluster utilization and job queue backlog.
- Set the compute capacity type to be `CapacityType=SPOT`. [AWS EC2 Spot instances](https://aws.amazon.com/ec2/spot) are available for less than the cost of On-Demand Instances, but it is possible that they are interrupted. As the training workload provides model checkpointing - saving the model as training progresses - you will be able to restart training after a job failure. Consider running other compute capacity types in the case of limited spot instance availability or when running large scale training workloads that cannot be interrupted. Refer to [this documentation](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-CapacityType) to learn more about the impact of Spot instance interruptions in ParallelCluster.  
- Set the custom actions install script URL to the S3 path with the Conda configuration script. Also, you need to specify that ParallelCluster has access to this S3 bucket. Add following to the config:

```bash
CustomActions:
  OnNodeConfigured:
    Script: s3://mlbucket-${BUCKET_POSTFIX}/post-install.sh
Iam:
  S3Access:
    - BucketName: mlbucket-${BUCKET_POSTFIX}
```
- The selected job scheduler for this example is SLURM.

{{% notice tip %}}
For more details about the configuration options, see the [AWS ParallelCluster User Guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/parallelcluster-version-3.html), the [EFA parameters ](https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-ComputeResources-Efa) and the [FSx parameters](https://docs.aws.amazon.com/parallelcluster/latest/ug/SharedStorage-v3.html#SharedStorage-v3-FsxLustreSettings) sections of the AWS ParallelCluster User Guide.
{{% /notice %}}

{{% notice note %}}
If you are using a different terminal than the previous section, make sure that the Amazon S3 bucket name is correct.
{{% /notice %}}

```bash
# create the cluster configuration
export AWS_REGION=$(curl --silent http://169.254.169.254/latest/meta-data/placement/region)
export IFACE=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/)
export SUBNET_ID=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/${IFACE}/subnet-id)
```

```yaml
cat > ml-config.yaml << EOF
Region: ${AWS_REGION}
Image:
  Os: alinux2
SharedStorage:
  - MountDir: /shared
    Name: default-ebs
    StorageType: Ebs

  - Name: fsxshared
    StorageType: FsxLustre
    MountDir: /lustre
    FsxLustreSettings:
      StorageCapacity: 1200
      ImportPath: s3://mlbucket-${BUCKET_POSTFIX}
      DeploymentType: SCRATCH_2

HeadNode:
  InstanceType: c5n.2xlarge
  Networking:
    SubnetId: ${SUBNET_ID}
  Ssh:
    KeyName: ${AWS_KEYPAIR}
  Dcv:
    Enabled: true

Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: compute
      ComputeResources:
      - Name: p3dn24xlarge
        InstanceType: p3dn.24xlarge
        MinCount: 0
        MaxCount: 2
        DisableSimultaneousMultithreading: true
        Efa:
          Enabled: true
      CapacityType: SPOT
      CustomActions:
        OnNodeConfigured:
          Script: s3://mlbucket-${BUCKET_POSTFIX}/post-install.sh
      Iam:
        S3Access:
          - BucketName: mlbucket-${BUCKET_POSTFIX}
      Networking:
        SubnetIds:
          - ${SUBNET_ID}
        PlacementGroup:
          Enabled: true
EOF
```

If you want to check the content of your configuration file, use the following command:

```bash
cat ml-config.yaml
```

Now, you are ready to create your Distributed ML cluster.

#### Generate a Cluster for Machine Learning

Create the cluster using the following command. This process would take about 15 minutes (depending on the resources/ settings).

```bash
pcluster create-cluster --cluster-name ml-cluster -c ml-config.yaml
```

The cluster creation continues even if the terminal session you are on gets terminated. To check on the status of the creation, use the command: `pcluster describe-cluster --cluster-name ml-cluster`.

#### Connect to Your Cluster

Once created, connect to your cluster.

```bash
pcluster ssh --cluster-name ml-cluster -i ${AWS_KEYPAIR}.pem
```

Next, preprocess the training data.