Segmentation Fault When Sending Arrays Over a Certain Size with Open MPI

I am writing a program to run with an ifiniband and intel-based cluster using openmpi, pmix, and SLURM scheduling.

When I run my program on the cluster with an input matrix over 38x38 on each node, I get a segfault on both send/recv and collective calls. Below 38x38 on each node, there are no issues. Also, the code works on a single node and with IntelMPI. The Segfault only occurs when using multiple nodes and OpenMPI.

Here is a minimal sample code that reproduces my error:

int main(int argc, char** argv) {
    int proc_size, p_rank;

    MPI_Init (&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &proc_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &p_rank); 

    int x = 1600;

    MPI_Status status;
    double* A = calloc(x, sizeof(double));

    if (p_rank == 0)
        for (int i = 1; i < proc_size; ++i) {
            MPI_Recv(A, x, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
        }
    else
        MPI_Send(A, x, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);

    MPI_Finalize();

    return 0;
}

I am using srun --mpi=pmix -n 36 my_program in my sbatch job, but the segfault also occurs when using mpirun.

When X is below 1500, it produces no errors. When X is above 1500, I get something similar to the following:

[node30:14535] *** Process received signal ***
[node42:144621] *** Process received signal ***
[node42:144621] Signal: Segmentation fault (11)
[node42:144621] Signal code: Address not mapped (1)
[node42:144621] Failing at address: 0x7fcc6fdcc210
[node30:14535] Signal: Segmentation fault (11)
[node30:14535] Signal code: Address not mapped (1)
[node30:14535] Failing at address: 0x7fe9fe17d210
[node19:91882] *** Process received signal ***
[node19:91882] Signal: Segmentation fault (11)
[node19:91882] Signal code: Address not mapped (1)
[node19:91882] Failing at address: 0x7fb02739d210
srun: error: node30: task 4: Segmentation fault
srun: error: node42: task 7: Segmentation fault

I should also note that the program completes successfully despite the error, but blocks the next run from starting.



Comments

Popular posts from this blog

Spring Elasticsearch Operations

Object oriented programming concepts (OOPs)

Spring Boot and Vaadin : Filtering rows in Vaadin Grid