Required and Recommended Components#
The Overview describes basic, high availability, advanced networking, and multi-tenant ICE ClusterWare ™ cluster architectures. The following sections provide minimum and recommended components for your cluster.
Head Nodes#
Use virtual machines hosted by bare metal hypervisors for ClusterWare head nodes. Virtual machines are easy to resize and easy to migrate between hypervisors. The bare metal hypervisor host must contain the aggregated resources required by each hosted virtual server plus additional CPUs/cores and RAM resources devoted to the hypervisor functionality itself. ClusterWare head nodes should use x86_64 processors running a Red Hat RHEL, Rocky, or similar distribution. See Supported Distributions and Features for specifics.
ClusterWare head nodes should ideally be lightweight for simplicity and contain only software that is needed for the local cluster configuration. Non-root users typically do not have direct access to head nodes and do not execute applications on head nodes.
Head node components for a production cluster include:
x86_64 processor(s) with a minimum of 4 cores. Including more than 4 cores in the virtual machine will speed up common activities, such as unpacking and packing images. Large clusters may need additional cores.
Note
Contact Penguin Computing if you are interested in using ClusterWare with AArch64 or RISC-V architectures.
4GB RAM (minimum), 8GB RAM (recommended). Large clusters may need additional RAM.
100GB local NVMe storage (minimum).
The largest storage consumption contains packed images, uploaded ISOs, and so on. Its location is set in the file
/opt/scyld/clusterware/conf/base.iniand defaults to/opt/scyld/clusterware/storage/. This location should be unique per head node if your cluster contains multiple head nodes. It should not be set to shared storage between head nodes.The directory
/opt/scyld/clusterware/git/cache/consumes storage roughly the size of the Git repos hosted by the system.Other than the
storage/andgit/cache/subdirectories discussed above, the/opt/scyld/directory consumes roughly 1GB.Each administrator's
~/.scyldcw/workspace/directory contains unpacked images that have been downloaded by an administrator for modification or viewing.Large clusters with more images, more nodes with log aggregation, or long retention periods for telemetry data may want more local storage on a head node.
One Ethernet controller (required) that connects to the private cluster network to interconnect the head node(s) with all compute nodes.
A second Ethernet controller (recommended) that connects the head node to the internet.
A High Availability ("HA") cluster requires a minimum of three production head nodes, each a virtual machine hosted on a different bare metal hypervisor. You can have up to seven head nodes.
Compute Nodes#
You can have from tens to thousands of compute nodes in a ClusterWare cluster. Compute nodes are generally bare metal servers for optimal performance. See Supported Distributions and Features for a list of supported distributions.
Compute node components for a production cluster include bare metal machines that meet the cluster application needs of your end users. The actual requirements for the ClusterWare software are low, but typical production compute nodes include:
x86_64 processors with multiple cores
192GB RAM or more
You can create a virtual compute node for testing or demonstrations. For example, cluster administrators may use a small number of virtual compute nodes to test system upgrades or image updates prior to production deployment. Virtual compute nodes require:
x86_64 processor with 1 core (minimum), 2 cores (recommended)
6GB RAM (minimum), 8GB RAM (recommended)
Login Nodes#
Login node resources should be similar to head nodes, but you may want additional compute power and storage if users are compiling and testing applications on the login node. Include enough storage for the local operating system (OS). If your cluster has shared storage accessible to the compute nodes, the shared storage should also be accessible from the login node.
Networking#
Multiple Ethernet or other high-performance network controllers (InfiniBand, Omni-Path) are common on the compute nodes, but do not need to be accessible by the head node(s).
Use the nmcli connection add tool to create network bridges and to add
physical interfaces to those newly created bridges. Once appropriate bridges
exist, use the virt-install command to attach the virtual interfaces to the
bridges, so that the created virtual machines exist on the same networks as the
physical interfaces on the hypervisor.
Important
By design, ClusterWare compute nodes handle DHCP responses on the private cluster network (bootnet) by employing the base distribution's facilities, including NetworkManager. If your cluster installs a network file system or other software that disables this base distribution functionality, then you must configure dhclient or custom static IP addresses and potentially additional workarounds.
Slurm and Kubernetes#
If you are using a Slurm node or Kubernetes node, consult their respective documentation for system requirements.
Multi-Tenant Clusters#
In addition to the head node, compute node, and networking requirements detailed above, multi-tenant clusters must include:
Enterprise Sonic switches version 4.4 or later that support EVPN technology, such as Enterprise Standard Sonic or Enterprise Premium Sonic.
InfiniBand network that supports PKey partitioning.
Multi-tenant clusters can include up to 64 Enterprise Sonic switches, up to 1200 compute nodes, approximately 10,000 GPUs, plus supporting infrastructure. Contact Penguin Computing to design a supported multi-tenant environment that meets your needs.