PlaFRIM Users Documentation

Table of Contents

1 Hardware Documentation

PlaFRIM aims to allow users to experiment with new hardware technologies and to develop new codes.

Access to the cluster state (sign-in required) : Pistache

You will find below a list of all the available nodes on the platform by category.

To allocate a specific category of node with SLURM, you need to specify the node features. To display the list, call the command

$ sinfo -o "%60f %N"
AVAIL_FEATURES                                               NODELIST
miriel,intel,haswell,infinipath                              miriel[044-045,048,050-053,056-058,060-064,066-073,075-076,078-079,081,083-088]
miriel,intel,haswell,omnipath,infinipath                     miriel[001-043]
amd,diablo,bigmem                                            diablo05
brise,intel,broadwell,bigmem                                 brise
bora,intel,cascadelake,omnipath                              bora[001-044]
sirocco,intel,broadwell,omnipath,nvidia,tesla,p100           sirocco[07-13]
sirocco,intel,skylake,omnipath,nvidia,tesla,v100             sirocco[14-16]
sirocco,intel,skylake,omnipath,nvidia,tesla,v100,bigmem      sirocco17
sirocco,intel,skylake,nvidia,quadro,rtx8000                  sirocco[18-20]
amd,zonda                                                    zonda[01-21]
arm,cavium,thunderx2                                         arm01
amd,diablo                                                   diablo[01-04]
kona,intel,knightslanding,knl                                kona[01-04]
sirocco,intel,haswell,mellanox,nvidia,tesla,k40m             sirocco[01-05]
sirocco,amd,nvidia,ampere,a100                               sirocco21
souris,sgi,ivybridge,bigmem                                  souris
visu                                                         visu01
mistral                                                      mistral[02-03,06]

For example, to reserve a bora node, you need to call

$ salloc -C bora

To reserve a sirocco node with V100 GPUs, you need to call

$ salloc -C "sirocco&v100"

1.1 Overview

  CPU Memory GPU Storage
bora001-044 2x 18-core Intel CascadeLake 192GB   /tmp of 1 To
miriel001-088 2x 12-core Intel Haswell 128 GB   /tmp of 300 GB
diablo001-004 2x 32-core AMD Zen2 256 GB   /tmp of 1 TB
diablo005 2x 64-core AMD Zen2 1 TB   /tmp of 1 TB
zonda01-21 2x 32-core AMD Zen2 256 GB    
arm01 2x 28-core ARM TX2 256 GB   /tmp of 128 GB
sirocco01-02,05 2x 12-core Intel Haswell 128 GB 4 NVIDIA K40M /tmp of 1 TB
sirocco03-04 2x 12-core Intel Haswell 128 GB 3 NVIDIA K40M /tmp of 1 TB
sirocco06 2x 10-core Intel IvyBridge 128 GB 2 NVIDIA K40M /tmp of 1 TB
sirocco07-13 2x 16-core Intel Broadwell 256 GB 2 NVIDIA P100 /tmp of 300 GB
sirocco14-16 2x 16-core Intel Skylake 384 GB 2 NVIDIA V100 /scratch of 750 GB
sirocco17 2x 20-core Intel Skylake 1 TB 2 NVIDIA V100 /tmp of 1 TB
sirocco18-20 2x 20-core Intel CascadeLake 1G2 GB 2 NVIDIA Quadro  
sirocco21 2x 24-core AMD Zen2 512 GB 2 NVIDIA A100 /scratch of 3.5 TB
kona01-04 64-core Intel Xeon Phi 96GB + 16GB   /scratch of 800 GB
brise 4x 24-core Intel Broadwell 1TB   /tmp of 280 GB
souris 12x 8-core Intel IvyBridge 3TB    

1.1.1 Network Overview

All nodes are connected through a 10Gbit/s Ethernet network that may also be used for the BeeGFS storage. The only exception are kona nodes since they only have 1Gbit/s Ethernet.

Additional HPC networks are available between some nodes:

  • OmniPath 100Gbit/s (3 separate networks):

    • bora nodes (and devel[01-02]), also used for BeeGFS storage.

    omnipath-bora.png

    • miriel[01-43] (and devel03).

    omnipath-miriel.png

    • sirocco[07-17] and all kona only
  • Mellanox InfiniBand HDR 200Gbit/s between diablo[01-05].

    infiniband-diablo.png

  • InfiniBand QDR 40Gbit/s between all miriel nodes and sirocco[01-06] (and devel03).

    truescale-miriel.png

    Beware that miriel and devel03 have TrueScale/InfiniPath hardware that requires its own software stack (PSM) for best performance, while Mellanox InfiniBand in sirocco requires the usual Verbs library. These technologies are compatible but the performance will be suboptimal between miriel and sirocco.

1.2 Standard nodes

1.2.1 bora001-044

CPU

2x 18-core Cascade Lake Intel Xeon Skylake Gold 6240 @ 2.6 GHz (CPU specs).

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-bora.png

Memory
192 GB (5.3 GB/core) @ 2933 MT/s.
Network

OmniPath 100 Gbit/s.

10 Gbit/s Ethernet.

Storage

Local disk (/tmp) of 1 To (SATA Seagate ST1000NX0443 @ 7.2krpm).

BeeGFS over 100G OmniPath.

1.2.2 miriel001-088

These nodes are in best effort and without support, and will be removed from the platform when failing to start.

CPU

2x 12-core Haswell Intel Xeon E5-2680 v3 @ 2.5 GHz. Haswell specs.

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-miriel.png

Memory
128 GB (5.3 GB/core) @ 2933 MT/s.
Network

OmniPath 100 Gbit/s on miriel[001-043].

InfiniBand QDR 40 Gbit/s (TrueScale/InfiniPath).

10 Gbit/s Ethernet.

Storage

Local disk (/tmp) of 300 GB (SATA Seagate ST9500620NS @ 7.2krpm).

BeeGFS over 10G Ethernet.

1.2.3 diablo001-005

CPU

2x 32-core AMD Zen2 EPYC 7452 @ 2.35 GHz on diablo01-04 (CPU specs).

lstopo-diablo01.png

2x 64-core AMD Zen2 EPYC 7702 @ 2 GHz on diablo05 (CPU specs).

lstopo-diablo05.png

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

Memory

256 GB (4 GB/core) @ 2133 MT/s (diablo01-04).

1 TB (4 GB/core) @2133 MT/s (diablo05).

Network

Mellanox InfiniBand HDR 200 Gbit/s.

10 Gbit/s Ethernet.

Storage

Local disk (/tmp) of 1 TB (SATA Seagate ST1000NM0008-2F2 @ 7.2krpm).

BeeGFS over 10G Ethernet.

1.2.4 zonda01-21

CPU

2x 32-core AMD Zen2 EPYC 7452 @ 2.35 GHz (CPU specs).

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-zonda01.png

Memory
256 GB (4 GB/core) @ 3200 MT/s.
Network
10 Gbit/s Ethernet.
Storage
BeeGFS over 10G Ethernet.

1.2.5 arm01

CPU

2x 28-core ARM Cavium ThunderX2 CN9975 v2.1 @ 2.0 GHz (CPU specs).

By default, Turbo-Boost is disabled to ensure the reproducibility of the experiments carried out on the nodes.

However Hyperthreading is enabled on this node.

lstopo-arm01.png

Memory
256GB (4.6 GB/core) @ 2666 MT/s.
Network
10 Gbit/s Ethernet.
Storage

Local disk (/tmp) of 128 GB (SATA Seagate ST1000NM0008-2F2 @ 7.2krpm).

BeeGFS over 10G Ethernet.

1.3 Accelerated nodes

1.3.1 sirocco01-05 with 3-4 NVIDIA K40M GPUs

CPU

2x 12-core Haswell Intel Xeon E5-2680 v3 @ 2.5 GHz (Haswell specs).

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-sirocco01.png

Memory
128 GB (5.3 GB/core) @ 2133 MT/s.
GPUs

4 NVIDIA K40M (12GB) on sirocco[01-02,05].

3 NVIDIA K40M (12GB) on sirocco[03-04].

Network

Mellanox InfiniBand QDR 40 Gbit/s.

10 Gbit/s Ethernet.

Storage

Local disk (/tmp) of 1 TB (SATA Seagate ST91000640NS @ 7.2krpm).

BeeGFS over 10G Ethernet.

View the node's internals

1.3.2 sirocco06 with 2 NVIDIA K40M GPUs

CPU

2x 10-core Ivy-Bridge Intel Xeon E5-2670 v2 @ 2.5 GHz (Ivy Bridge specs).

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-sirocco06.png

Memory
128 GB (6.4 GB/core) @ 1866 MT/s.
GPUs
2 NVIDIA K40m (12GB).
Network

Mellanox InfiniBand QDR 40 Gbit/s.

10 Gbit/s Ethernet.

Storage

Local disk (/tmp) of 1 TB (SATA Seagate ST1000NM0023 @ 7.2krpm).

BeeGFS over 10G Ethernet.

1.3.3 sirocco07-13 with 2 NVIDIA P100 GPUs

CPU

2x 16-core Broadwell Intel Xeon E5-2683 v4 @ 2.1 GHz (Broadwell specs).

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-sirocco07.png

Memory
256 GB (8GB/core) @ 2133 MT/s.
GPUs
2 NVIDIA P100 (16GB).
Network

Omnipath 100 Gbit/s.

10 Gbit/s Ethernet.

Storage

Local disk (/tmp) of 300 GB (SAS WD Ultrastar HUC156030CSS204 @ 15krpm).

BeeGFS over 10G Ethernet.

1.3.4 sirocco14-16 with 2 NVIDIA V100 GPUs and a NVMe disk

CPU

2x 16-core Skylake Intel Xeon Gold 6142 @ 2.6 GHz (CPU specs).

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-sirocco14.png

Memory
384 GB (12 GB/core) @ 2666 MT/s.
GPUs
2 NVIDIA V100 (16GB).
Network

Omnipath 100 Gbit/s.

10 Gbit/s Ethernet.

Storage

Local disk (/scratch) of 750 GB (NVMe Samsung).

BeeGFS over 10G Ethernet.

1.3.5 sirocco17 with 2 NVIDIA V100 GPUs and 1TB memory

CPU

2x 20-core Skylake Intel Xeon Gold 6148 @ 2.4GHz (CPU specs).

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-sirocco17.png

Memory
1 TB (25.6 GB/core) @ 1866 MT/s.
GPUs
2 NVIDIA V100 (16GB).
Network

Omnipath 100 Gbit/s.

10 Gbit/s Ethernet.

Storage

Local disk (/tmp) of 1 TB (SAS Seagate ST300MP0026 @ 15krpm).

BeeGFS over 10G Ethernet.

1.3.6 sirocco18-20 with 2 NVIDIA Quadro RTX8000 GPUs

CPU

2x 20-core Cascade Lake Intel Xeon Gold 5218R CPU @ 2.10 GHz (CPU specs).

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-sirocco18.png

Memory
192 GB (4.8GB/core) @ 3200 MT/s.
GPUs
2 NVIDIA Quadro RTX8000 (48GB).
Network
10 Gbit/s Ethernet.
Storage

No local storage.

BeeGFS over 10G Ethernet.

1.3.7 sirocco21 with 2 NVIDIA A100 GPUs

CPU

2x 24-core AMD Zen2 EPYC 7402 @ 2.80 GHz (CPU specs).

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-sirocco21.png

Memory
512GB (10.6GB/core) @ 3200 MT/s.
GPUs
2 NVIDIA A100 (40GB).
Network
10 Gbit/s Ethernet.
Storage

Local disk (/scratch) of 3.5 TB (RAID0 of 2 SAS SSD TOSHIBA KRM5XVUG1T92 Rm5 Mixed Use).

BeeGFS over 10G Ethernet.

1.3.8 kona01-04 Knights Landing Xeon Phi

CPU

64-core Intel Xeon Phi 7230 @ 1.3 GHz (4 hyperthreads per core). (Airmont core specs)

By default, Turbo-Boost is disabled to ensure the reproducibility of the experiments carried out on the nodes.

However Hyperthreading is enabled on these nodes.

Memory

96GB of DRAM (1.5GB/core) @ 2400 MT/s.

16GB of MCDRAM on-package (0.25GB/core).

Network

Only 1 Gbit/s Ethernet.

Omnipath 100 Gbit/s.

Storage

Local disk (/scratch) of 800 GB (SSD Intel SSDSC2BX80).

BeeGFS over 1G Ethernet.

KNL configuration

kona01 is in Quadrant/Flat: 64 cores, 2 NUMA nodes for DRAM and MCDRAM.

lstopo-kona01.png

kona02 is in Quadrant/Cache: 64 cores, 1 NUMA node for DRAM with MCDRAM as a cache in front of it.

lstopo-kona02.png

kona03 is in SNC-4/Flat: 4 clusters with 16 cores and 2 NUMA nodes each.

lstopo-kona03.png

kona04 is in SNC-4/Cache: 4 clusters with 16 cores, 1 DRAM NUMA node and MCDRAM as a cache.

lstopo-kona04.png

1.4 Big Memory Nodes

Two nodes are specifically considered as Big Memory nodes : brise and souris which are described below.

Two other nodes could be also considered as Big Memory nodes : diablo05 and sirocco17 as they have 1 TB of memory as described above.

1.4.1 brise with 4 sockets, 96 cores and 1 TB memory

CPU

4x 24-core Intel Xeon E7-8890 v4 @ 2.2GHz (CPU specs).

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-brise.png

Memory
1 TB (10.7 GB/core) @ 1600 MT/s.
Network
10 Gbit/s Ethernet.
Storage

Local disk (/tmp) of 280 GB (SAS @ 15krpm).

BeeGFS over 10G Ethernet.

1.4.2 souris (SGI Altix UV2000) with 12 sockets, 96 cores and 3 TB memory

CPU

12x 8-core Intel Ivy-Bridge Xeon E5-4620 v2 @ 2.6 GHz (Ivy Bridge specs)

By default, Turbo-Boost and Hyperthreading are disabled to ensure the reproducibility of the experiments carried out on the nodes.

lstopo-souris.png

Memory
3TB (16GB/core) @ 1600MT/s.
Network
10 Gbit/s Ethernet.
Storage

No local storage.

BeeGFS over 10G Ethernet.

2 Software Documentation

2.1 Operating System

CentOS (Community enterprise Operating System) Release 7.6.1810.

2.2 Slurm

SLURM (Simple Linux Utility for Resource Management) is a scalable open-source scheduler used on a number of world class clusters.

The currently installed SLURM version on PlaFRIM is 19.05.2.

You will find below a brief description to help users to launch jobs on the platform. More details are available in the official SLURM Quick Start User Guide and in the official SLURM documentation.

2.2.1 Getting Information About Available Nodes

You can see the list of all the nodes in the hardware documentation section.

To have the state of cluster go to https://www.plafrim.fr/state/

To allocate a specific category of node with SLURM, you need to specify the node features. To display the list, call the command:

$ sinfo -o "%60f %N"
AVAIL_FEATURES                                               NODELIST
miriel,intel,haswell,omnipath,infinipath                     miriel[001-043]
miriel,intel,haswell,infinipath                              miriel[044-045,048,050-053,056-058,060-064,066-073,075-076,078-079,081,083-088]
amd,zonda                                                    zonda[01-21]
sirocco,intel,broadwell,omnipath,nvidia,tesla,p100           sirocco[07-13]
bora,intel,cascadelake,omnipath                              bora[001-044]
sirocco,intel,haswell,mellanox,nvidia,tesla,k40m             sirocco[01-05]
sirocco,intel,skylake,omnipath,nvidia,tesla,v100             sirocco[14-16]
sirocco,intel,skylake,omnipath,nvidia,tesla,v100,bigmem      sirocco17
arm,cavium,thunderx2                                         arm01
brise,intel,broadwell,bigmem                                 brise
amd,diablo,mellanox                                          diablo[01-04]
amd,diablo,bigmem,mellanox                                   diablo05
kona,intel,knightslanding,knl,omnipath                       kona[01-04]
sirocco,intel,skylake,nvidia,quadro,rtx8000                  sirocco[18-20]
sirocco,amd,nvidia,ampere,a100                               sirocco21
souris,sgi,ivybridge,bigmem                                  souris
visu                                                         visu01
mistral                                                      mistral[02-03,06]

We will see below how to use specific nodes for a job, as an example, to reserve a bora node, you need to call

$ salloc -C bora

sinfo has many parameters, for example:

-N (--Node)
Print information in a node-oriented format.
-l (--long)
Print more detailed information.
$ sinfo -l
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
routage*     up 3-00:00:00 1-infinite   no       NO        all     42    drained* miriel[001,005,008,010,016-017,020,022,024,027,043-045,048,050-053,057,060,062-064,067-071,073,075-076,078-079,081,083-088],sirocco[03-04]
routage*     up 3-00:00:00 1-infinite   no       NO        all      1  allocated* bora011
routage*     up 3-00:00:00 1-infinite   no       NO        all      3       down* miriel[019,038,056]
routage*     up 3-00:00:00 1-infinite   no       NO        all      1    draining zonda03
routage*     up 3-00:00:00 1-infinite   no       NO        all      4     drained bora009,miriel004,zonda[01-02]
routage*     up 3-00:00:00 1-infinite   no       NO        all      3       mixed miriel[002-003],sirocco07
routage*     up 3-00:00:00 1-infinite   no       NO        all     17   allocated bora[001-007],diablo[03-04],sirocco[01,14-17],zonda[04-06]
[…]

$ sinfo -N
NODELIST   NODES PARTITION STATE
arm01          1  routage* idle
bora001        1  routage* alloc
bora002        1  routage* alloc
bora003        1  routage* alloc
bora004        1  routage* alloc
bora005        1  routage* alloc
bora006        1  routage* alloc
bora007        1  routage* alloc
bora008        1  routage* idle
bora009        1  routage* drain
[…]

2.2.2 Running Interactive Jobs

salloc allows to run jobs with different steps, and use at each step all or a subset of resources.

$ salloc -N 3
salloc: Granted job allocation 1155503
salloc: Waiting for resource configuration
salloc: Nodes sirocco[01-03] are ready for job

The command squeue can be used to have a look at the job state:

$ squeue --job 17397
JOBID  PARTITION NAME  USER     ST  TIME  NODES NODELIST(REASON)
17397  routage   bash  bouchoui R   1:05  2     miriel[007-008]

In the same shell terminal, run srun your_executable (the command will use all the allocated resources).

$ srun hostname
sirocco01.plafrim.cluster
sirocco02.plafrim.cluster
sirocco03.plafrim.cluster

$ srun -N 1 hostname
sirocco01.plafrim.cluster

You can connect to the first node using

$ srun --pty bash -i

You can also login to one of the allocated nodes by using ssh however all slurm's variables environment will not be set.

$ ssh miriel007

Once connected to a node with ssh, if you want to run a command on all allocated resources, you must run the srun command with the jobid option followed by the id associated to your job.

@miriel007~$ srun --jobid=17397 hostname
miriel007
miriel008

One can also define the following arguments to the command srun

-N 1 (or --nodes=1)
the node count, by default it is equal to 1.
-n 1 (or --ntasks=1)
number of tasks, by default it is equal to 1, otherwise it must be equal to or less than the number of cores of the node
--exclusive
allocate node(s) in exclusive mode

You can also use directly srun without a salloc.

$ srun --pty bash -i
$ hostname
miriel004.plafrim.cluster

The option --pty also works when asking more than one node. You will be connected to the first node. To see which are nodes are part of the job, you can look at the environment variable SLURM_JOB_NODELIST.

2.2.3 Running Non-Interactive (Batch) Jobs

$ cat script-slurm.sl
#!/usr/bin/env bash
# Job name
#SBATCH -J TEST_Slurm
# Asking for one node
#SBATCH -N 1
#SBATCH -n 4
# Standard output
#SBATCH -o slurm.sh%j.out
# Standard error
#SBATCH -e slurm.sh%j.err

echo "=====my job information ===="
echo "Node List: " $SLURM_NODELIST
echo "my jobID: " $SLURM_JOB_ID
echo "Partition: " $SLURM_JOB_PARTITION
echo "submit directory:" $SLURM_SUBMIT_DIR
echo "submit host:" $SLURM_SUBMIT_HOST
echo "In the directory:" $PWD
echo "As the user:" $USER

module purge
module load compiler/gcc
srun -n4 hostname

Launch the job using the command sbatch

$ sbatch script-slurm.sl
Submitted batch job 7421

to get information about the running jobs

$ squeue

and more...

$ scontrol show job <jobid>

to delete a running job

$ scancel <jobid>

to watch the output of the job 17421

$ cat slurm.sh17421.out
=====my job information ====
Node List:  zonda05
my jobID:  1461714
Partition:  routage
submit directory: /home/furmento
submit host: devel03.plafrim.cluster
In the directory: /home/furmento
As the user: furmento
zonda05.plafrim.cluster
zonda05.plafrim.cluster
zonda05.plafrim.cluster
zonda05.plafrim.cluster

2.2.4 Getting Information About A Job

There is 2 commands:

  • squeue

    $ squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.3C %.20R" --job 17397
    

    The different header are:

    • JOBID The job identifier
    • PARTITION The partition on which the job is running, use sinfo to display all partitions on the cluster.
    • NAME the name of job, to define or change the name (in batch mode) use -J name_of_job.
    • USER the login of the job owner
    • ST the state of submitted job PENDING, RUNNING, FAILED, COMPLETED, ... etc.
    • TIME The time limit for the job (NOTE : if the user doesn't define the time limit for his job, the default time limit of the partition will be used).
    • NODE Size of nodes.
    • NODELIST List of nodes used.

    The different job states are:

    • PD (pending): Job is awaiting resource allocation,
    • R (running): Job currently has an allocation,
    • CA (cancelled): Job was explicitly cancelled by the user or system administrator,
    • CF (configuring): Job has been allocated resources, but is waiting for them to become ready,
    • CG (completing): Job is in the process of completing. Some processes on some nodes may still be active,
    • CD (completed): Job has terminated all processes on all nodes,
    • F (failed): Job terminated with non-zero exit code or other failure condition,
    • TO (timeout): Job terminated upon reaching its time limit,
    • NF (node failure): Job terminated due to failure of one or more allocated nodes.
  • scontrol.

    $ scontrol show job 1155454
    JobId=1155454 JobName=plafrim-master-plafrim-gcc.sl
    UserId=furmento(10193) GroupId=storm(11118) MCS_label=N/A
    Priority=1 Nice=0 Account=(null) QOS=normal
    JobState=RUNNING Reason=None Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:06:15 TimeLimit=01:00:00 TimeMin=N/A
    SubmitTime=2021-01-19T08:34:06 EligibleTime=2021-01-19T08:34:06
    AccrueTime=2021-01-19T08:34:06
    StartTime=2021-01-19T08:34:06 EndTime=2021-01-19T09:34:06 Deadline=N/A
    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-01-19T08:34:06
    Partition=routage AllocNode:Sid=devel02.plafrim.cluster:300429
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=sirocco08
    BatchHost=sirocco08
    NumNodes=1 NumCPUs=24 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=24,node=1,billing=24
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
    Features=sirocco DelayBoot=00:00:00
    OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
    Command=/home/furmento/buildbot/plafrim-master-plafrim-gcc.sl
    WorkDir=/home/furmento/buildbot
    StdErr=/home/furmento/buildbot/slurm-1155454.out
    StdIn=/dev/null
    StdOut=/home/furmento/buildbot/slurm-1155454.out
    Power=
    

2.2.5 Asking for GPU nodes

The sirocco nodes have GPUs (see the Hardware Documentation section). You will need to specify the given constraints if you want a specific GPU card.

It is advised to use the exclusive parameter to make sure nodes are not used by another job at the same time.

$ srun --exclusive -C sirocco --pty bash -i

@sirocco08.plafrim.cluster:~> module load compiler/cuda

@sirocco08.plafrim.cluster:~> nvidia-smi
Tue Jan 19 10:52:07 2021
+-------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
+-------------------------------------------------------------------+
| 0 Tesla P100-PCIE... | On | ...
| 1 Tesla P100-PCIE... | On | ...
[...]

2.2.6 Killing A Job

To kill all running jobs in batch session, use the scancel command with the login name option or with the list of job’s id (separated by space):

$ scancel -u <user>

or

$ scancel jobid_1 ... jobid_N

The command squeue can be used to get the job ids.

$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2545 longq test1 bee R 4:46:27 1 miriel007
2552 longq test2 bee R 4:46:47 1 miriel003
2553 longq test1 bee R 4:46:27 1 miriel004

Interactive jobs can also be killed by exiting the current shell.

2.2.7 Launching multi-prog jobs

It is possible to run a job with several nodes and launch different programs on different set of nodes.

Here an example of such a multiprogram configuration file.

############################################################
# srun multiple program configuration file
#
# srun -n8 -l –multi-prog silly.conf
############################################################
4-6 hostname
1,7 echo task:%t
0,2-3 echo offset:%o

To submit such a file, use the following command

$ srun -n8 -l --multi-prog silly.conf

You will get a output similar to

4: miriel004.plafrim.cluster
6: miriel004.plafrim.cluster
5: miriel004.plafrim.cluster
7: task:7
1: task:1
2: offset:1
3: offset:2
0: offset:0

2.2.8 Which node(s) do I get by default?

If you do not specify any constraints, SLURM will try first to allocate nodes that do not have any advanced features. The idea is to avoid allocating rare nodes with advanced features (GPUs, large memory, high-speed network, etc) unless really needed.

The current weights for nodes are as follows:

  • zonda
  • miriel
  • bora
  • diablo
  • sirocco
  • arm01
  • visu01
  • kona
  • brise
  • souris

It means zonda nodes are allocated first when possible, while souris is only allocated when requested or when all other nodes are busy.

2.3 Preemption queue

The queue preempt allows to run jobs outside the usual limits on unused computing resources without blocking the access to the resources for jobs running on the other queues.

This means jobs can be stopped ("preempted") suddenly at every moment if a regular job needs the resources.

The job will be restarted when the resources become available.

The code must regularly backup its state ("checkpoint") and make sure the backup is safe (the job could be stopped while backing up) and be able to restart on a previous backup.

The execution time is limited to 3 days. All nodes are reachable, to limit the execution on some nodes, you can use the nodes constraints. To know all the available constraints, one can use:

$ sinfo -o "%.100N %.12c %.20R %.90f"|grep preempt
NODELIST CPUS PARTITION AVAIL_FEATURES
miriel[044-045,048,050-053,056-058,060,062-064,067-071,073,075-076,078-079,081,083-088] 24 preempt miriel,intel,haswell,infinipath
miriel[001-006,008-043]                                                                 24 preempt miriel,intel,haswell,omnipath,infinipath
zonda[01-21]                                                                            64 preempt amd,zonda

2.3.1 Advises to backup your application state (checkpointing)

  • Use a specific function to backup and another function to restore from a backup.
  • Backup in a temporary file (or several files in a temporary directory), then rename (atomic operation) the file or directory with a final name to stamp the backup.
    • If the application is stopped during the backup, the temporary backup will be ignored, the previous stamped backup will be used.
  • When your application starts, it should first check if a backup is available, and if yes, use it to restart from it.
  • Backing up should typically be done after a MPI barrier to make sure all nodes are synchronized.
  • Backup frequency should be adapted to its duration (data writing on disk). A rough idea is that an application should not run for more than 30 mins to 1 hour without doing a backup.

2.4 Modules

2.4.1 Introduction

The Environment Modules package is a tool that simplifies shell initialization and lets users easily modify their environment during the session with module files.

Each module file contains the information needed to configure the shell for an application. Once the Modules package is initialized, the environment can be modified on a per-module basis using the module command which interprets module files. Typically module files instruct the module command to alter or set shell environment variables such as PATH, MANPATH, etc. Module files may be shared by many users on a system and users may have their own collection to supplement or replace the shared module files.

Modules can be loaded and unloaded dynamically and atomically, in an clean fashion.

2.4.1.1 Example

By default, the compiler gcc is the one installed by the system.

$ type gcc
gcc is /usr/bin/gcc

You can decide to use a specific one installed with a module.

$ module load compiler/gcc/10.1.0
$ type gcc
gcc is /cm/shared/modules/intel/skylake/compiler/gcc/10.1.0/bin/gcc

And decide to switch to another version.

$ module switch compiler/gcc/9.2.0
$ type gcc
gcc is /cm/shared/modules/intel/skylake/compiler/gcc/9.2.0/bin/gcc

and finally to come back to the default system compiler.

$ module unload compiler/gcc/9.2.0
$ type gcc
gcc is /usr/bin/gcc
2.4.1.2 Other module commands
  • module avail → list available modules on the system.
  • module list → list currently loaded modules.
  • module purge → unload all loaded modules.

2.4.2 Module Naming Policy

To ease module management, they are sorted according to the architecture of the nodes, and grouped in categories

The different architectures are:

  • generic for modules which can run on all nodes
  • intel/haswell for the nodes miriel and sirocco[01-05]
  • intel/broadwell for the nodes sirocco[07-13]
  • intel/skylake for the nodes bora
  • intel/knightslanding for the nodes kona

When connecting to a node via salloc, the environment variable MODULEPATH contains the directory generic and the node-specific directory manufacturer/chip

Within each architecture, modules are grouped with the following module naming policy

/category/module/option/version

the number of options being between 0 and as many as needed.

For example

  • partitioning/scotch/int32/6.0.4
  • partitioning/scotch/int64/6.0.4

2.4.3 Dev and Users Modules

  • Modules managed by the technical team (MPI, GCC and CUDA compilers) are available in /cm/shared/modules
  • User-managed modules are available in /cm/shared/dev/modules
  • All users can install modules, one only needs to be added to the Unix group plafrim-dev (ticket to plafrim-support AT inria.fr)

2.4.4 How to Create a Module

2.4.4.1 File System
  • Modules files go in /cm/shared/dev/modules/architecture/modulefiles by following the architecture and the naming policies.
  • Installation application files go in /cm/shared/dev/modules/architecture/apps with the same architecture and naming policies.
  • For example, let's install for all nodes the version 1.1.8 of the trace generator application eztrace
    • Install your application in /cm/shared/dev/modules/generic/apps/trace/eztrace/1.1.8/
    • Create the module file /cm/shared/dev/modules/generic/modulefiles/trace/eztrace/1.1.8
proc ModulesHelp { } {
  puts stderr "\tAdds EzTrace 1.1.8 to your environment variables"
}

module-whatis "adds eztrace 1.1.8 trace generator tool to your environment variables"

set             version         1.1.8
set             prefix          /cm/shared/dev/modules/generic/apps/trace/eztrace
set             root            $prefix/$version

#path added in the beginning
prepend-path   CPATH  $root/include
prepend-path   LIBRARY_PATH  $root/lib
prepend-path   LD_LIBRARY_PATH  $root/lib
  • Set the correct permissions
$ module load tools/module_cat
$ module_perm /cm/shared/dev/modules/generic/apps/trace/eztrace/1.1.8
$ module_perm /cm/shared/dev/modules/generic/modulefiles/trace/eztrace/1.1.8
2.4.4.2 Dependencies

If your module depends on other modules and has been compiled with different versions of this module.

  • First solution

    • Define a single module file /cm/shared/dev/modules/generic/modulefiles/perftools/simgrid/3.24
    • specify the dependency(ies)
    prereq compiler/gcc
    
    $ module show compiler/gcc
    ...
    setenv GCC_VER 9.3.0
    ...
    
    • and use their information
    set prefix /cm/shared/dev/modules/generic/apps/perftools/simgrid/$version/install/gcc_$env(GCC_VER)
    
  • Second solution

    • Define two different module files
      • /cm/shared/dev/modules/generic/modulefiles/perftools/simgrid/3.24/gcc_8.2.0
      • /cm/shared/dev/modules/generic/modulefiles/perftools/simgrid/3.24/gcc_9.2.0
    • To avoid loading 2 versions of the same module
    conflict perftools/simgrid
    

2.4.5 What are the useful variables?

Path to development headers (for compiling)

prepend-path CPATH ...
prepend-path FPATH ...
prepend-path INCLUDE ...
prepend-path C_INCLUDE_PATH ...
prepend-path CPLUS_INCLUDE_PATH ...
prepend-path OBJC_INCLUDE_PATH ...

Path to libraries (for linking)

prepend-path LIBRARY_PATH ...

Path to tools and libraries (for running)

prepend-path PATH ...
prepend-path LD_LIBRARY_PATH ...

pkg-config to simplifying build systems (point to directories with .pc files)

prepend-path PKG_CONFIG_PATH $prefix/lib/pkgconfig

Manpages

append-path  MANPATH   $man_path
append-path  MANPATH   $man_path/man1
append-path  MANPATH   $man_path/man3
append-path  MANPATH   $man_path/man7

Some modules define specific variables, likely because one random project ever decided they need them...

setenv            CUDA_INSTALL_PATH   $root
setenv            CUDA_PATH           $root
setenv            CUDA_SDK            $root
prepend-path      CUDA_INC_PATH       $root/include

setenv          HWLOC_HOME          $prefix

setenv         MPI_HOME          $prefix
setenv         MPI_RUN           $prefix/bin/mpirun
setenv         MPI_NAME          $name
setenv         MPI_VER           $version

2.4.6 Which version gets selected by default during load?

The default is in (reverse) alphabetical order

$ module avail formal/sage
--- /cm/shared/dev/modules/generic/modulefiles ---
formal/sage/7.0 formal/sage/8.9 formal/sage/9.0
$ module load formal/sage
$ module list
Currently Loaded Modulefiles:
1) formal/sage/9.0

Another default version may be enforced. Useful if your last beta release isn't stable or backward compatible but still needed for some hardcore users.

$ cat /cm/shared/dev/modules/generic/modulefiles/hardware/hwloc/.version
#%Module1.0#
set ModulesVersion "2.1.0"

2.4.7 More on environment dependent modules

$ module_grep starpu
runtime/starpu/1.3.2/mpi
runtime/starpu/1.3.2/mpi-fxt
runtime/starpu/1.3.3/mpi
runtime/starpu/1.3.3/mpi-cuda
runtime/starpu/1.3.3/mpi-cuda-fxt
runtime/starpu/1.3.3/mpi-fxt

StarPU has different module files which ONLY differ in the prereq commands and the prefix setting.

> prereq compiler/cuda/10.1
> prereq trace/fxt/0.3.9
< set     prefix          /cm/shared/dev/modules/generic/apps/runtime/starpu/1.3.3/gcc@9.2.0-hwloc@2.1.0-openmpi@4.0.2
> set     prefix          /cm/shared/dev/modules/generic/apps/runtime/starpu/1.3.3/gcc@8.2.0-hwloc@2.1.0-openmpi@4.0.1-cuda@10.1
> set     prefix          /cm/shared/dev/modules/generic/apps/runtime/starpu/1.3.3/gcc@8.2.0-hwloc@2.1.0-openmpi@4.0.1-cuda@10.1-fxt@0.3.9

Create a module file with all the cases

if {![ is-loaded compiler/gcc ]} {
   module load compiler/gcc
}
if {![ is-loaded hardware/hwloc ]} {perftools
   module load hardware/hwloc/2.1.0
}

conflict runtime/starpu

set cuda ""
set fxt ""
if {[ is-loaded compiler/cuda/10.1 ]} {
  set cuda -cuda@10.1
}
if {[ is-loaded trace/fxt/0.3.9 ]} {
  set fxt -fxt@0.3.9
}

set     name            starpu
set     version         1.3.3

set     prefix /cm/shared/dev/modules/generic/apps/runtime/$name/$version/gcc@$env(GCC_VER)-hwloc@2.1.0-openmpi@4.0.1${cuda}${fxt}

Different environments lead to a different version of StarPU to be used.

2.4.7.1 1st case
$ module purge
$ module load runtime/starpu/42
$ module list
Currently Loaded Modulefiles:
1) compiler/gcc/9.2.0 2) hardware/hwloc/2.1.0 3) runtime/starpu/42
$ module show runtime/starpu/42
setenv STARPU_DIR /cm/shared/dev/modules/generic/apps/runtime/starpu/1.3.3/gcc@9.2.0-hwloc@2.1.0-openmpi@4.0.1
2.4.7.2 2nd case
$ module purge
$ module load compiler/gcc/8.2.0 compiler/cuda/10.1 runtime/starpu/42
$ module show runtime/starpu/42
setenv STARPU_DIR /cm/shared/dev/modules/generic/apps/runtime/starpu/1.3.3/gcc@8.2.0-hwloc@2.1.0-openmpi@4.0.1-cuda@10.1
2.4.7.3 3rd case
$ module purge
$ module load compiler/gcc/8.2.0 compiler/cuda/10.1 trace/fxt runtime/starpu/42
$ module show runtime/starpu/42
setenv STARPU_DIR /cm/shared/dev/modules/generic/apps/runtime/starpu/1.3.3/gcc@8.2.0-hwloc@2.1.0-openmpi@4.0.1-cuda@10.1-fxt@0.3.9

2.4.9 Module tools/module_cat

The module tools/module_cat provides the following tools

  • module_list to list the existing categories
  • module_add, module_init, module_add, module_rm to modify the environment variable MODULEPATH which defines the folders in which to look for modules
  • module_perm to set the correct permissions on a given directory
  • module_search to search all modules whose name have the given string, e.g module_search hwloc

2.5 Parallel Programming (MPI)

MPI usage depends on the implementation of MPI being used (for more details http://slurm.schedmd.com/mpi_guide.html).

We describe how to use OpenMPI and IntelMPI which are installed on PlaFRIM.

2.5.1 OpenMPI

Load your environment using the appropriate modules.

Currently, we provide the following OpenMPI implementations

$ module avail mpi/openmpi
mpi/openmpi/2.0.4 mpi/openmpi/3.1.4 mpi/openmpi/4.0.1
mpi/openmpi/4.0.1-intel mpi/openmpi/4.0.2 mpi/openmpi/4.0.2-testing
mpi/openmpi/4.0.3 mpi/openmpi/4.0.3-mlx

To use the 4.0.3 version

$ module load mpi/openmpi/4.0.3

To run a MPI program, you can use mpirun.

$ salloc -N 3

$ mpirun hostname
miriel040.plafrim.cluster
miriel041.plafrim.cluster
miriel042.plafrim.cluster

$ salloc -n 3

$ mpirun hostname
miriel023.plafrim.cluster
miriel023.plafrim.cluster
miriel023.plafrim.cluster

To compile a MPI application, you can use mpicc

mpicc -o program program.c

To run the program

mpirun --mca btl openib,self program

To launch MPI applications on miriel, sirocco and devel nodes :

$ mpirun -np <nb_procs> --mca mtl psm  ./apps

If you need OmniPath interconnect :

$ mpirun -np <nb_procs> --mca mtl psm2 ./apps

2.5.2 Intel MPI

Load your environment using the appropriate modules.

$ module avail mpi/intel
[...]
$ module add compiler/gcc compiler/intel mpi/intel

Create a file with the names of the machines that you want to run your job on:

$ srun hostname -s| sort -u > mpd.hosts

To run your application on these nodes, use mpiexec.hydra, and choose the fabrics for intra-node and inter-nodes mpi communcation:

$ export I_MPI_FABRICS=shm:tmi mpiexec.hydra -f mpd.hosts -n $SLURM_NNODES ./a.out

Choose your build and execution environment using modules, for instance:

$ module load compiler/gcc
$ module add compiler/intel
$ module add mpi/intel

Launch with mpiexec.hydra command:

$ srun hostname  -s| sort -u > mpd.hosts

Select the particular network fabrics to be used with the environment variable I_MPI_FABRICS.

I_MPI_FABRICS=<fabric>|<intra-node fabric>:<inter-node fabric>

Where <fabric> := {shm, dapl, tcp, tmi, ofa}

For example, to select shared memory fabric (shm), for intra-node communication mpi process, and tag maching interface fabric (tmi), for inter-node communication mpi process, use the following command:

$ export I_MPI_FABRICS=shm:tmi
$ mpiexec.hydra -f mpd.hosts -n $SLURM_NPROCS ./a.out

The available fabrics on the platform are:

tmi
TMI-capable network fabrics including Intel True Scale Fabric, Myrinet, (through Tag Matching Interface)
ofa
OFA-capable network fabric including InfiniBand (through OFED verbs)
dapl
DAPL-capable network fabrics, such as InfiniBand, iWarp, Dolphin, and XPMEM (through DAPL)
tcp
TCP/IP-capable network fabrics, such as Ethernet and InfiniBand (through IPoIB)

You can also specify a list of fabrics, (The default value is dapl,tcp)with the environment variable I_MPI_FABRICS_LIST. The first fabric detected will be used at runtime:

I_MPI_FABRICS_LIST=<fabrics list>

Where <fabrics list> := <fabric>,…,<fabric>

(for more details visit https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference_Manual/Communication_Fabrics_Control.htm)

2.6 Software management with GNU Guix

In addition to module, PlaFRIM users can manage software environments using GNU Guix, a general-purpose package manager.

2.6.1 Why use Guix?

Guix can be used in addition to and in parallel with module. There are several reasons why it might be useful to you:

  • Guix provides more than 7,500 software packages including: utilities such as tmux, compiler toolchains (GCC, Clang), Python software (Scikit-Learn, NumPy, etc.), HPC libraries (Open MPI, MUMPS, PETSc, etc.).
  • Pre-built binaries are usually available for packages you install, which makes installation fast.
  • You get to choose when you upgrade or remove packages you've installed for yourself, and can roll back any time you want should an upgrade go wrong.
  • You can reproduce the exact same software environment, bit-for-bit, on PlaFRIM and on other machines (laptop, cluster, etc.)
  • Software environments can be "packed" as a Docker image for use on other systems.

2.6.2 Getting Started

2.6.2.1 Looking for packages

You can browse the on-line package list or use one of these commands:

$ guix package --list-available
$ guix package -s <keyword>
2.6.2.2 Installing Software

By default Guix installs software in your home directory, under ~/.guix-profile. On PlaFRIM, installing software with Guix automatically updates your environment variables such that, on your next login, PATH, PYTHONPATH, and similar variables point to ~/.guix-profile.

  • To install the latest GNU compilation toolchain, run:

    $ guix package --install gcc-toolchain
    
  • To install Python 3.x along with NumPy and SciPy (note: the command is called python3, not python), run:

    $ guix package -i python python-numpy python-scipy
    
  • Setting search path environment variables:

    $ eval `guix package --search-paths=prefix`
    
  • Updating the package set:

    $ guix pull
    
2.6.2.3 Dealing with "Profile Generations"
  • To list your "profile generations" (i.e., the successive changes to your set of installed packages):

    $ guix package -l
    
  • To roll back to a previous generation of your "profile":

    $ guix package --roll-back
    

2.6.3 Using the Guix-HPC Packages

We maintain a package collection for software developed by Inria research teams such as STORM, HiePACS, and TaDaaM in the Guix-HPC repository. To use it, run:

$ git clone https://gitlab.inria.fr/guix-hpc/guix-hpc.git
$ export GUIX_PACKAGE_PATH=$PWD/guix-hpc/modules

Non-free software such as CUDA, as well as variants of free software packages with dependencies on non-free software (such as starpu-cuda) are available separately (requires a gitlab.inria.fr account):

$ git clone https://gitlab.inria.fr/guix-hpc/guix-hpc-non-free.git
$ export GUIX_PACKAGE_PATH=$PWD/guix-hpc-non-free/modules

2.6.4 Creating Portable Bundles

Once you have a software environment that works well on PlaFRIM, you may want to create a self-contained "bundle" that you can send and use on other machines that do not have Guix. With guix pack you can create "container images" for Docker or Singularity, or even standalone tarballs. See the following articles for more information:

2.6.5 Support

For more information, please see:

Please send any support request to plafrim-guix@inria.fr.

2.7 3D Visualization with VirtualGL and TurboVNC

  • Install and setup on your desktop TurboVNC Viewer
  • Connect to plafrim :

    $ module load slurm visu/srun
    $ srun-visu
    
  • The first time, TurboVNC will ask you for a password to secure the X11 session
  • Wait for a result like :

    Waiting for a slot on a visualization serverUsing 3D visualization
    with VirtualGL and TurboVNC
    
    Desktop 'TurboVNC: visu01:1 (login)' started on display visu01:1
    
    Starting applications specified in /home/login/.vnc/xstartup.turbovnc
    
    Log file is /home/login/.vnc/visu01:1.log
    
    Launched vncserver: visu01:1
    
    Now, in another terminal, open a new SSH session to plafrim like this: "ssh plafrim -N -L 5901:visu01:5901 &amp;" and launch TurboVNC viewer (vncviewer command) on your desktop on "localhost:1"
    
  • Now open another SSH using the suggested command.
  • In the Applications menu, you will find the Visit and Paraview softwares.
  • In order to use a 3D (OpenGL) program via the CLI, put vglrun before the command, like: vglrun paraview
  • Halting srun-visu (via Ctrl-C ou scancel) or closing the first SSH session will stop the post-processing session.
  • On you desktop, run TurboVNC vncviewer
  • vncviewer will ask you for the DISPLAY value (from previous command) and the session password.
  • Important : The default session time is limited to two hours. To specify a different session time, use :

    $ srun-visu --time=HH:MM:SS
    
  • A session time cannot exceed eight hours (--time=08:00:00)

2.8 IRODS Storage Resource

2.8.1 Introduction

An iRODS Storage Resource is available at the MCIA (mésocentre Aquitain).

It allows to backup your research data.

2.8.2 Information

IMPORTANT : Encryption is not available (without authentication). Data are unencrypted both on the disks and on the network. If necessary, you need to encrypt data by yourself.

Data are scattered over 7 sites (Bordeaux and Pau).

IRODS keeps 3 copies of every file:

  • one nearby the storage resource the data was first copied on,
  • one at the MCIA (near Avakas),
  • one in another storage resource.

Default quota : 500Gb.

Help https://redmine.mcia.univ.bordeaux.fr/projects/irods

2.8.3 How to use the system

One needs an account at the mesocentre. To do so, go to the page to request a Avakas account at inscriptions@mcia.univ-bordeaux.fr

  • Connect to the mesocentre to initialize your IRODS account, choose a specific password for iRODS. When loading the module, the iRODS account will be initialised.

    $ ssh VOTRELOGIN@avakas.mcia.univ-bordeaux.fr
    $ module load irods/mcia
    
  • On PlaFRIM - Prepare your environment by calling the command iiint and answer the given questions.

    $ module load tools/irods
    $ iinit
    
    • Host: icat0.mcia.univ-bordeaux.fr
    • Port: 1247
    • Zone: MCI
    • Default Resource : siterg-imb (for PlaFRIM) (from another platform, choose siterg-ubx)
    • Password : ...

The IRODS password can only be changed from Avakas

$ module load irods/mcia
$ mcia-irods-password

2.8.4 Basic commands

The list of all available commands can be found here.

2.8.4.1 FTP
  • icd [irods_path] (change the working directory)
  • imkdir irods_path (create a directory)
  • ils [irods_path] (list directory contents)
  • iput locale_file [irods_path] (upload a file)
  • iget irods_file [locale_path] (download a file)
2.8.4.2 rsync
  • irsync locale_path irods_path
    • Irods path start with « i: » : i:foo/bar
    • irsync foo i:bar/zzz
    • ~irsync i:bar/zzz foo

3 FAQ

3.1 Citing PlaFRIM in your publications and in the HAL Open Archive

Don’t forget to cite PlaFRIM in all publications presenting results or contents obtained or derived from the usage of PlaFRIM.

Here's what to insert in your paper acknowlegments section :

Experiments presented in this paper were carried out using the PlaFRIM experimental testbed, supported by Inria, CNRS (LABRI and IMB), Université de Bordeaux, Bordeaux INP and Conseil Régional d’Aquitaine (see https://www.plafrim.fr).

When you deposit a publication in the Hal Open Archive, please add plafrim in the Project/Collaboration field of the metadata.

You may also check the current list of publications registered in HAL.

3.2 Access and Storage

3.2.1 Connecting to PlaFRIM

Your ssh must be configured properly to access PlaFRIM. Details are available in the private page How to connect? (requires sign-in) as well as in the email you received when your account was created.

3.2.2 Storage

3.2.2.1 Six storage spaces with different purposes
/home/<LOGIN>

Max size : 20 Go

Deletion : Never

Hardware Protection (RAID) : Yes

Backup : Regular + versioning

Primary use : individual

How to obtain : automatic

Quota usage command : quota -f /home

/projets/<PROJET>

Size : 200 Go

Deletion : Never

Hardware Protection (RAID) : Yes

Backup : Regular + versioning

Primary use : group. This space is a storage space that can be shared between several users to deposit data, software, …

How to obtain : on demand. To obtain such a space, simply send an email to Plafrim Support, specifying the name and description of the project, with the list of people connected to this project.

Quota usage command : du -s /projets/<PROJET>

DEPRECATED /lustre/<LOGIN>

Max size : 1 To

Deletion : Never

Hardware Protection (RAID) : Yes

Backup : No

Primary use : individual

How to obtain : automatic

Quota usage command : lfs quota -u <LOGIN> /lustre

/beegfs/<LOGIN>

Max size : 1 To

Deletion : Never

Hardware Protection (RAID) : Yes

Backup : No

Primary use : individual

How to obtain : automatic

Quota usage command : beegfs-ctl --getquota --uid <LOGIN>

/tmp

Max size : variable

Deletion : If needed and when restarting machines

Hardware Protection (RAID) : No

Backup : No

Primary use : individual

How to obtain : automatic

/scratch

Max size : variable

Deletion : If needed and when restarting machines

Hardware Protection (RAID) : No

Backup : No

Primary use : individual

How to obtain : automatic. This space is only available on sirocco[14,15,16,21]

3.2.2.2 Restoring lost files

Each directory under /home/<LOGIN> or /projets/<PROJET> has a .snapshot directory in which you can retrieve lost files.

Only /home et /projets directories have snapshots activated and are replicated off-site for 4 weeks.

3.2.3 Accessing an external site from PlaFRIM

Users must send a ticket to plafrim-support stating the site they need to access and the reason.

The technical team will approve the request after checking it does not lead to any technical issues (security…).

3.3 Continous integration

3.3.1 May I run a Continous Integration (CI) daemon on PlaFRIM?

The feature is under testing.

Please do not run such daemons (gitlab runner, jenkins slaves, etc.) in your home account.

A public announce will be made when the feature and procedure are implemented.

3.4 Misc

3.4.1 Getting Help

For community sharing, basic questions, etc.
contact plafrim-users or use the Mattermost server.
To exchange more widely about the platform usage
contact your representative in the user committee.
For technical problem (access, account, administration, modules, etc)
open a ticket by contacting plafrim-support.

For more details and links, see the private support page (requires login). Links to these resources are also given in the invite message when you open a SSH connection to PlaFRIM front-end nodes.

3.4.2 Changing my password for the website plafrim.fr

This is the usual wordpress password change procedure.

  • Go to http://www.plafrim.fr/wp-login.php
  • Click on “Lost your password ?”
  • Enter either your PlaFRIM username (your SSH login) or the email address you used to create your PlaFRIM account, and click on “Get New Password”
  • Check your email inbox
  • Click on the link (the longer one) proposed within this email
  • Choose your new password
  • Test you can connect at http://www.plafrim.fr/connection/

3.4.3 Improving this documentation

Author: root

Created: 2021-11-25 Thu 15:09

Validate