The Restricted Saion Partitions

The “gpu-v100” and “gpu-p100”, and “gpu-a100” partitions are restricted partitions meant for general GPU or high-core computation. The gpu-a100 partition is specifically for computations that need a lot of GPU memory. You can apply for access here.

The Saion system works a little differently from the main cluster. If you haven’t already, please read the Saion introduction here for an introduction to the system.

New partitions

We have recently created new partitions with a more consistent naming scheme:

New	Old
gpu-p100	“gpu” partition p100 nodes
gpu-v100	“gpu” partition v100 nodes
gpu-a100	“largegpu” nodes.

The old partitions are still available, but new Saion users will get access to the new partitions.

Relion, CryoSparc etc.

If you run and maintain software such as Relion or Cryosparc for others to use, you should add the new partitions in your setup. New users won’t be able to use the old partition names.

“gpu-v100” and “gpu-p100”

NOTE: this pair of partitions was previously named “gpu”. New users instead get access to “gpu-v100” and “gpu-p100”, and existing “gpu” users will eventually be moved to the new pair of partitions.

This pair of partitions has a total of 16 GPU nodes. Each GPU node has 36 CPU cores and 512GB main memory. 8 nodes have four NVIDIA P100 GPUs and 16GB memory each. Another 8 nodes each have four NVIDIA V100 GPUs with 16GB memory. You get access to both partitions. See below for the differences between them.

Your allocation on each partition is 36 CPU cores and 4 GPUs. in total you can use 72 cores and 8 GPUs, with 4 GPUs on gpu-v100 and 4 GPUs on gpu-p100. Your longest job run time is 7 days. You can contact us at ask-scda@oist.jp to change the allocation on either partition to 8 GPUs, 72 CPU cores and a 2-day limit if you prefer.

For the legacy GPU partition: you have a total of 4 GPUs for 7 days across both the p100 and v100 nodes. You can ask us to switch you to the new partitions if you like.

Start a Job

To start a job you need to specify the ‘gpu-v100’ or ‘gpu-p100’ partition with the -p option. To get access to a GPU, you also need to ask for a GPU with the --gres (short for “general resource”) option:

$ srun -t 0-4 -c 8 -p gpu-v100 --mem=32G --gres=gpu:1 --pty bash -l

This gives you 32G memory, 8 cores and one GPU for 4 hours. It starts a new interactive command line (“bash -l”) on the node.

“gpu-v100” and “gpu-p100” differences

NOTE: the P100 partition is still being upgraded.

To get an upgraded node, for now add

--nodelist=saion-gpu[11-14]

To your Slurm options. </b>

Both partitions have the same operating system and the same host hardware. However, the “gpu-p100” GPUs are one generation older than the V100 GPUs, and the last, final supported version of CUDA for the P100 GPUs is 12.8.

On the gpu-v100 partition, the current newest version is CUDA 13.2. Also, as V100 is a little newer they are also a bit faster, though the difference is typically not significant.

The legacy “gpu” partition

If you are a legacy “gpu” user: In order to choose which GPU type you want to use, you can add the “v100” or “p100” option to gres:

$ srun ...  --gres=gpu:v100:1 ...

This will ensure you get one of the v100 nodes.

building your code

You should build your application on a GPU compute node. The login nodes do not have the CUDA and other libraries that you typically need.

We have CUDA 11.1, 11.3, 12.2, 12.8 and 13.2 installed as modules on the V100 nodes. The P100 nodes have CUDA up to 12.8. Always load the CUDA version you want as a module.

To start an interactive job, try this:

$ srun -t 0-4 -c 8 -p gpu-v100 --mem=32G --gres=gpu:1 --pty bash -l

If you only intend to compile the code and not test it, you can of course refrain from asking for a GPU at all. That way you can still use a GPU node for building your code even if all GPUs are in use by somebody else.

NOTE: If you want your code to work on both gpu-p100 and gpu-v100, build on a gpu-p100 node using CUDA no no newer than 12.8. That program should then work without changes on both gpu-p100 and gpu-v100 — and gpu-a100 if you need that.

The “gpu-a100” partition (formerly “largegpu”)

This a restricted partition with access only to those who can’t use the regular gpu partition.

The partition has 4 nodes. Each node has 8 NVIDIA A100-SXM4 GPUs, 128 cores and 2TB memory; that is 16 cores and 256GB memory per GPU. Each A100 GPU has 80GB memory.

The A100 has five times more memory than the P100 and V100 GPUs in the “gpu-v100” and “gpu-p100” partitions. Performance should be 2-3× faster for many applications. Memory bandwidth-limited tasks should see a performance increase toward the lower end. Deep learning code built to take advantage of the specifics of the A100 can be more than 3× faster.

Access

To get access to the gpu-a100 partition you need to show your requirements are not already fulfilled by the P100 and V100 GPUs.

If you are interested, please submit a request on this page (select “gpu partitions” then “gpu-a100”) with a description of your workflow and why the gpu partition is not sufficient. We will ask you to come to the Open Hours where we will discuss your request.

We have no default allocation for gpu-a100. Your resources will be based on the specific task you are trying to run.

Logging in and building your code

gpu-a100 is similar to the other gpu partitions; you need to build your code on one of the compute nodes as the login nodes don’t have the required libraries.

You access the node by specifying the “gpu-a100” partition and use the “–gres” option to specify the number of GPUs:

$ srun -p gpu-a100 -c 16 --mem=128G --gres=gpu:1 --pty bash -l

On the nodes you have access to CUDA 11.0, 11.3, 12.2, 12.8, and 13.2 and they share the modules with the regular gpu partition nodes. Note that while older CUDA versions are available, the A100 hardware does not support CUDA 10 or older, so any code using older versions will need to be updated or rebuilt.

The OS version is Rocky 8.8 (Redhat 8.8 without the branding). Most modules and most code that you built on the “gpu-p100” and “gpu-v100” partitions should work on largegpu (as long as they use CUDA 11 or later). However, the opposite will generally not work: if you build an application on a largegpu node it may not run on the older partitions.

Previous Section: The open Saion partitions.