7. Parallelized version

7.1. General remarks

The parallelized version of Spex uses the MPI standard 3.1. It can be run on several CPUs on the same node or on several nodes with the command mpirun or mpiexec or whatever MPI launcher your computer system uses. In principle, there are no restrictions with respect to the number of processes. A better performance is expected, though, if the number of processes is not a prime number but has a long prime factorization because this gives the code more freedom to distribute the work among the processes.

The default parallelization strategy of Spex is conservative in the sense that memory demand and load imbalances are minimized. Often, the parallelized run can be sped up substantially by using MPIKPT or MPIBLK, see below.

7.2. Special MPI keywords

7.2.1. MPIKPT

In many calculation types (GW, Hubbard U calculations, COHSEX, …) there is an outer loop over the k-point set. By default, Spex does not parallelize over this loop because different k points need different computation times depending on their symmetry, making the work distribution non-trivial. However, if there are many k points, it is recommendable to additionally parallelize over this loop. This can be enabled with the keyword MPIKPT. The k-loop parallelization is over nodes, not over processes. The computation for each individual k point runs in parallel over the processes on the respective node, in the same way as all processes would without MPIKPT. If you only have a single node (or very few nodes) available, you can still use MPIKPT in conjunction with MPISPLIT, which allows processes to be grouped into virtual nodes.

7.2.2. MPIBLK (SENERGY)

Another special parallelization layer is the parallelization over blocks of the self-energy matrix (or over the diagonal elements). This may speed up the calculation if there are many blocks but may also result in work imbalance. (Different blocks need different computation times.) Parallelization over blocks is enabled with the keyword MPIBLK. (An optional argument, e.g., MPIBLK 5, can be used to fine-tune the work distribution. It gives the “relative computational overhead” of each block that does not scale with the number of bands. The default value is 10.) MPIBLK is enabled automatically except for GW FULL calculations, because the different sizes of self-energy blocks in GW FULL may lead to work imbalances. We note that MPIBLK increases the memory demand.

Examples
MPIBLK Enable parallelization over self-energy blocks (or diagonal elements).
MPIBLK 50 Enable parallelization with assumed large “computational overhead”.
MPIBLK 0 Disable parallellization over blocks.

7.2.3. MPISPLIT

(*) The shared-memory functionality of MPI 3.1 is used for several big arrays, which allows the same memory region to be accessed by several MPI processes. By default, all processes running on one node share the memory. It can be reasonable to change this behavior to, e.g., having processes on the same socket or on the same NUMA domain to share memory. This is possible with MPISPLIT NODE (default), MPISPLIT SOCKET (only works with OpenMPI), or MPISPLIT SHRD=16, where, in this example, groups of 16 processes will share memory: the ranks 0-15, 16-31, etc. Using this option increases the memory consumption but might be advantageous in terms of memory bandwidth and computation time.

7.2.4. MPISYM (SENERGY)

(*) Using Padé approximants in the evaluation of the GW self-energy (CONTINUE or CONTOUR with Padé approximant for W or FREQINT PADE) might lead to a slight symmetry breaking in the quasiparticle energies, leading to unphysical lifting of degeneracies. (This is caused by the fact that Thiele’s continued-fraction Padé formula is numerically unstable, especially for a large number of imaginary frequencies.) Since these errors are usually very small, this is not a big problem. Furthermore, when the full self-energy matrix is calculated (e.g., GW FULL), Spex performs a symmetrization of the self-energy matrix, which enforces the correct degeneracies again. However, for testing purposes, it is possible to enforce the correct symmetries already in the evaluation of the self-energy by using the keyword MPISYM. This requires additional communication among the processes, potentially slowing down the calculation due to the necessary blocking synchronization.

7.2.5. RESTART

In the parallelized version, the RESTART option works in exactly the same way as for the serial version (see Section 4.1.7, Section 5.1.13, and Section 5.5.3). However, the restart data might be written to separate files when MPIKPT is used. (The underlying reason for this is that binary or HDF5 files can be written in parallel, i.e., by all processes at the same time, only if the dataset sizes are known in advance. This is not the case for the restart data.) Instead of a single file “spex.cor”, Spex writes the files “spex.cor.1”, “spex.cor.2”, et cetera, and a directory “spex.cor.map”, which contains, for each k point, links to the respective cor file that contains the data. Furthermore, in addition to “spex.sigc” (“spex.sigx”, “spex.wcou”, “spex.ccou”, “spex.core”), the files “spex.sigc.2”, “spex.sigc.3”, et cetera, might be written (and analogously for the other file names). These multiple files should be taken into account, when restart files are transferred. Switching between different numbers of processes, different numbers of nodes, or between serial and parallel runs should not lead to problems. Spex should be able to always read the correct data.

Note

Paragraphs discussing advanced options are preceded with (*), and the ones about obsolete, unmaintained, or experimental options are marked with (**). You can safely skip the paragraphs marked with (*) and (**) at first reading.