The Restricted Saion Partitions

The “gpu”and “largegpu” partitions are restricted partitions meant for general GPU or high-core computation. The largeGPU partition is specifically for computations that need a lot of GPU memory. You can apply for access here.

The Saion system works a little differently from the main cluster. If you haven’t already, please read the Saion introduction here for an introduction to the system.

“gpu”

This partition has a total of 16 GPU nodes. Each GPU node has 36 CPU cores and 512GB main memory. 8 nodes have four Tesla P100 GPUs and 16GB memory each. Another 8 nodes each have four NVIDIA V100 GPUs with 16GB memory.

The “gpu” partition gives you the next free node, irrespective of the GPU type. If you want you can specify the GPU type (see below), but in practice many jobs won’t see a significant difference.

Your allocation on this partition is 36 CPU cores and 4 GPUs in total. Your longest job run time is 7 days. You can contact us at ask-scda@oist.jp to change this to 8 GPUs, 72 CPU cores and a 2-day limit if you prefer.

Start a Job

To start a job you need to specify the ‘gpu’ partition with the -p option. To get access to a GPU, you also need to ask for a GPU with the --gres option:

$ srun -t 0-4 -c 8 -p gpu --mem=32G --gres=gpu:1 --pty bash -l

This gives you 32G memory, 8 cores and one GPU for 4 hours.

At this time, the V100 nodes have been updated with a recent OS version and recent NVIDIA drivers and CUDA. You most likely want to use these. To ask for a V100 GPU specifically, add it to the gres option like this:

$ srun ...  --gres=gpu:v100:1 ...

This will ensure you get one of the updated nodes.

building your code

You should build your application on a GPU compute node. The login nodes do not have the CUDA and other libraries that you typically need.

We have CUDA 11.1, 11.3, 12.2 and 12.8 installed as modules on the V100 nodes. The P100 nodes only have CUDA up to 11.3. Always load the CUDA version you want as a module.

To start an interactive job, try this:

$ srun -t 0-4 -c 8 -p gpu --mem=32G --gres=gpu:v100:1 --pty bash -l

If you only intend to compile the code and not test it, you can of course refrain from asking for a GPU at all. That way you can still use a GPU node for building your code even if all GPUs are in use by somebody else.

The “largegpu” partition

This a restricted partition with access only to those who can’t use the regular gpu partition.

The partition has 4 nodes. Each node has 8 NVIDIA A100-SXM4 GPUs, 128 cores and 2TB memory; that is 16 cores and 256GB memory per GPU. Each A100 GPU has 80GB memory.

The A100 has five times more memory than the P100 and V100 GPUs in the “gpu” partition. Performance should be 2-3× faster for many applications. Memory bandwidth-limited tasks should see a performance increase toward the lower end. Deep learning code built to take advantage of the specifics of the A100 can be more than 3× faster.

Access

To get access to the largegpu partition you need to show your requirements are not already fulfilled by the P100 and V100 GPUs in the gpu partition.

If you are interested, please submit a request on this page (select “gpu partitions” then “largegpu”) with a description of your workflow and why the gpu partition is not sufficient. We will ask you to come to the Open Hours where we will discuss your request.

We have no default allocation for largegpu. Your resources will be based on the specific task you are trying to run.

Logging in and building your code

largegpu is similar to the gpu partition; you need to build your code on one of the compute nodes as the login nodes don’t have the required libraries.

You access the node by specifying the “largegpu” partition and use the “–gres” option to specify the number of GPUs:

$ srun -p largegpu -c 16 --mem=128G --gres=gpu:1 --pty bash -l

On the nodes you have access to CUDA 11.0, 11.3, 12.2 and 12.8, and they share the modules with the regular gpu partition nodes. Note that while older CUDA versions are available, the A100 hardware does not support CUDA 10 or older, so any code using older versions will need to be updated or rebuilt.

The OS version is Rocky 8.8 (Redhat 8.8 without the branding). Most modules and most code that you built on the GPU partition should work on largegpu (as long as they use CUDA 11 or later). However, the opposite will generally not work: if you build an application on a largegpu node it may not run on the regular GPU partition.

Previous Section: The open Saion partitions.