Release Notes#

ClusterWareAI ™ Release v13.2.0 is the latest update to the ClusterWareAI platform.

For the most up-to-date product documentation, visit https://docs.ice.penguinsolutions.com/. The most recent version will accurately reflect the current state of the ClusterWareAI Yum repository of RPMs that you are about to install. For additional helpful information about ClusterWareAI, visit the Penguin Computing Support Portal at https://www.penguinsolutions.com/en-us/support.

13.2.0 Release Notes#

  • Version 13.2.0 introduces the AI Factory Operations Agent in the ClusterWareAI GUI. The AI Factory Operations Agent uses a locally available LLM to query the ClusterWareAI backend to answer basic questions about your cluster via a natural language chat window. The AI Factory Operations Agent uses a new system-level account that is restricted to read-only permissions.

  • A new ClusterWareAI Health Monitoring System (CHMS) is available. It includes a set of default health checks that you can assign to your compute nodes and use to monitor and report on common compute node issues, such as memory issues, and hardware- and software-specific problems, such as NVIDIA GPU health or Slurm problems.

  • The Auto Remediation Service (ARS) is enhanced and now works with CHMS. There are also new ClusterWareAI GUI pages to monitor compute node health, uptime, and remediation status. You can configure ARS to work with industry-standard workload schedulers, including Slurm and Kubernetes.

  • A compute node terminal is now available from the ClusterWareAI GUI. Use the terminal to execute a single command on multiple nodes at once (10s or 100s), collect the outputs, and view them in a compact display. Commands are run as root, so you can use the terminal to query information from nodes, change settings, install packages, and so on.

See Changelog for a full history of ClusterWareAI releases, and Known Issues And Workarounds for a summary of notable known current issues.

Deprecated Features#

The following features were deprecated with the 13.1 release. In the case of commands or command arguments, the legacy values will continue to work for now, but will be removed fully in a future release. Scripts or other automation should be updated accordingly.

  • The user token duration argument name changed from timeout or lifetime to lifespan.

  • ClusterWareAI commands now use the cw-* prefix rather than the scyld-* prefix.

  • The rwtab value is removed from the _boot_rw_layer reserved attribute.