2023-09-21

Segmentation Fault When Sending Arrays Over a Certain Size with Open MPI

I am writing a program to run with an ifiniband and intel-based cluster using openmpi, pmix, and SLURM scheduling.

When I run my program on the cluster with an input matrix over 38x38 on each node, I get a segfault on both send/recv and collective calls. Below 38x38 on each node, there are no issues. Also, the code works on a single node and with IntelMPI. The Segfault only occurs when using multiple nodes and OpenMPI.

Here is a minimal sample code that reproduces my error:

int main(int argc, char** argv) {
    int proc_size, p_rank;

    MPI_Init (&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &proc_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &p_rank); 

    int x = 1600;

    MPI_Status status;
    double* A = calloc(x, sizeof(double));

    if (p_rank == 0)
        for (int i = 1; i < proc_size; ++i) {
            MPI_Recv(A, x, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
        }
    else
        MPI_Send(A, x, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);

    MPI_Finalize();

    return 0;
}

I am using srun --mpi=pmix -n 36 my_program in my sbatch job, but the segfault also occurs when using mpirun.

When X is below 1500, it produces no errors. When X is above 1500, I get something similar to the following:

[node30:14535] *** Process received signal ***
[node42:144621] *** Process received signal ***
[node42:144621] Signal: Segmentation fault (11)
[node42:144621] Signal code: Address not mapped (1)
[node42:144621] Failing at address: 0x7fcc6fdcc210
[node30:14535] Signal: Segmentation fault (11)
[node30:14535] Signal code: Address not mapped (1)
[node30:14535] Failing at address: 0x7fe9fe17d210
[node19:91882] *** Process received signal ***
[node19:91882] Signal: Segmentation fault (11)
[node19:91882] Signal code: Address not mapped (1)
[node19:91882] Failing at address: 0x7fb02739d210
srun: error: node30: task 4: Segmentation fault
srun: error: node42: task 7: Segmentation fault

I should also note that the program completes successfully despite the error, but blocks the next run from starting.



No comments:

Post a Comment