# Plugin Integration Into Existing Slurm Cluster If you have an existing Slurm cluster and want to enable it to scale into AWS then you can integrate the `SlurmPlugin.py` into your existing cluster. The plugin implements the Slurm power saving API scripts so the integration is as simple as copying the plugin and it's scripts into your file system and then modifying your slurm configuration to use them. ## Network Requirements The instructions for integrating the AWS cluster with on-premises compute nodes contains the [network requirements](/onprem/#network-requirements). The VPC's DNS must be configured so that it can resolve the on-prem hostnames such as the slurm controller and file systems. ## File System Requirements The compute nodes in the AWS VPC will need access to the same file systems paths as your existing compute nodes. This can easily be accomplished by mounting your on-prem file systems across the VPN of Direct Connect connection, but that will likely have significant latency and performance impacts. If your on-prem file systems are NetApp Ontap then you can create [Amazon FSx for NetApp Ontap](https://aws.amazon.com/fsx/netapp-ontap/) file systems configured with FlexCache volumes to provide low latency, high performance caches of your on-prem file systems on AWS. If you do not use NetApp then you can use [Amazon File Cache](https://aws.amazon.com/filecache/) to create high performance file caches of your on-prem file systems. ## Create AWS Slurm Cluster The plugin requires a number of configuration files as well as AWS resources for the compute nodes such as Amazon Machine Images (AMIs), IAM instance roles, and security groups. The easiest way to create all of the required resources is to follow the [instructions to deploy a cluster with on-premises compute nodes](/onprem) into your AWS VPC and then use those resources to configure your existing cluster. Note that `slurm_nodes_on_prem.conf` can be empty and doesn't have to include any on-prem compute nodes. The configuration file should include the file systems that should be mounted on the AWS compute nodes. After you've deployed the AWS cluster, Mount the slurm file system on your slurm controller at
/opt/slurm/cluster-name## AWS Credentials The power saving scripts run as the `slurm` user on your slurm controller and require AWS IAM permissions to orchestrate AWS instances for your Slurm cluster. Specifically, the slurm controller needs access to the AMIs and permission to start, stop, and terminate EC2 instances. The permissions are defined in the SlurmCtlPolicy resource of the cluster's Cloudformation stack. You can create a `slurm` IAM user and attach the SlurmCtlPolicy. Then you can create IAM credentials for the `slurm` user and save them on the slurm controller using `aws config`. ## Amazon Machine Images for Compute Nodes You can modify the AMIs created by the AWS slurm cluster for use by your on-prem cluster. ## SSM Parameter Store Variables The plugin reads serveral SSM parameters to get information that it needs to orchestrate instances. | SSM Parameter | Description |---------------|------------- |
/cluster-name/SlurmNodeAmi/os-distribution/os-major-version/architecture/region| AMIs for AWS EC2 compute nodes |
/cluster-name/SlurmNodeEc2KeyPairs/region| EC2 Keypairs for each region |
/cluster-name/SlurmNodeSecurityGroups/region| Security groups for AWS EC2 compute nodes | Field | Valid Values |-------|------------- | os-distribution | AlmaLinux \| Amazon \| CentOS \| RedHat \| Rocky | architecture | arm64 \| x86_64 ## Slurm Configuration The existing slurm cluster's configuration file will need to be updated to to configure the AWS compute nodes and to use the `SlurmPlugin.py`. Update your `slurm.conf` based on `/opt/slurm/slurmaws/etc/slurm.conf`. At a minimum you will need to add the following lines. ``` # CommunicationParameters: # NoAddrCache: Do not assume that nodes will retain their IP addresses. Do not cache node->ip mapping CommunicationParameters = NoAddrCache # ReturnToService # 0: Make node available if it registers with a valid configuration regardless of why it is down. # 1: Only return to service if DOWN due to being non-responsive. # 2: A DOWN node will become available for use upon registration with a valid configuration. ReturnToService=2 # TreeWidth is set to the maximum for cloud clusters so messages go directly between controller and nodes. # https://slurm.schedmd.com/elastic_computing.html TreeWidth = 65533 # # SlurmctldParameters # cloud_dns: Do not set cloud_dns unless joining the domain and adding nodes to DNS # Joining the domain overloads the DCs when lots of nodes are started so # not joining the domain or using cloud_dns. # idle_on_node_suspend: Mark nodes as idle, regardless of current state, when suspending # nodes with SuspendProgram so that nodes will be eligible to be resumed at a later time. # node_reg_mem_percent: Percentage of memory a node is allowed to register with without being marked as invalid with low memory. # Use this so can configure nodes with 100% of their memory. # Without this option the node will fail because the system uses some of the memory. SlurmctldParameters=\ idle_on_node_suspend,\ ,node_reg_mem_percent=90 # # Allow users to see state of nodes that are powered down PrivateData = cloud # # POWER SAVE SUPPORT FOR IDLE NODES # SuspendProgram = /opt/slurm/slurmaws/bin/slurm_ec2_stop.py ResumeProgram = /opt/slurm/slurmaws/bin/slurm_ec2_resume.py ResumeFailProgram = /opt/slurm/slurmaws/bin/slurm_ec2_resume_fail.py # Maximum time between when a node is suspended and when suspend is complete. # At that time it should be ready to be resumed. SuspendTimeout = 60 ResumeTimeout = 600 # Number of nodes per minute that can be resumed or suspended ResumeRate = 300 SuspendRate = 60 # Time that a node has to be idle or down before being suspended # Should be >= (SuspendTimeout + ResumeTimeout) SuspendTime = 660 include /opt/slurm/slurmaws/etc/slurm_nodes.conf ```