Segmentation Fault When Sending Arrays Over a Certain Size with Open MPI
I am writing a program to run with an ifiniband and intel-based cluster using openmpi, pmix, and SLURM scheduling.
When I run my program on the cluster with an input matrix over 38x38 on each node, I get a segfault on both send/recv and collective calls. Below 38x38 on each node, there are no issues. Also, the code works on a single node and with IntelMPI. The Segfault only occurs when using multiple nodes and OpenMPI.
Here is a minimal sample code that reproduces my error:
int main(int argc, char** argv) {
int proc_size, p_rank;
MPI_Init (&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &proc_size);
MPI_Comm_rank(MPI_COMM_WORLD, &p_rank);
int x = 1600;
MPI_Status status;
double* A = calloc(x, sizeof(double));
if (p_rank == 0)
for (int i = 1; i < proc_size; ++i) {
MPI_Recv(A, x, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
}
else
MPI_Send(A, x, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
I am using srun --mpi=pmix -n 36 my_program
in my sbatch job, but the segfault also occurs when using mpirun
.
When X is below 1500, it produces no errors. When X is above 1500, I get something similar to the following:
[node30:14535] *** Process received signal ***
[node42:144621] *** Process received signal ***
[node42:144621] Signal: Segmentation fault (11)
[node42:144621] Signal code: Address not mapped (1)
[node42:144621] Failing at address: 0x7fcc6fdcc210
[node30:14535] Signal: Segmentation fault (11)
[node30:14535] Signal code: Address not mapped (1)
[node30:14535] Failing at address: 0x7fe9fe17d210
[node19:91882] *** Process received signal ***
[node19:91882] Signal: Segmentation fault (11)
[node19:91882] Signal code: Address not mapped (1)
[node19:91882] Failing at address: 0x7fb02739d210
srun: error: node30: task 4: Segmentation fault
srun: error: node42: task 7: Segmentation fault
I should also note that the program completes successfully despite the error, but blocks the next run from starting.
Comments
Post a Comment