Jump to content

Error "syscheck.x: Rank 0:79: MPI_Allgather: ibv_reg_mr() failed: addr 0x7ff62a618010, len 2097152"

  • PrintPrint

Issue

The Platform MPI System check benchmark resulted in the following error:
[root@mgmt ~]# mpirun -v -d -hostlist bigsmp01:64,bigsmp02:64,bigsmp03:64 -np 192 -e PCMPI_SYSTEM_CHECK=BM -e MPI_RDMA_MSGSIZE=8192,8192,4193304 /HPC/syscheck.xdebug 1, pretend 0, verbose 1job 0, check 0, tv=0, mpirun_instr ???remsh = /usr/bin/sshSPMD cmd: -hostlist bigsmp01:64,bigsmp02:64,bigsmp03:64 -np 192 -e PCMPI_SYSTEM_CHECK=BM -e MPI_RDMA_MSGSIZE=8192,8192,4193304 /HPC/syscheck.xMain socket port 44505Temporary appfile: /tmp/mpiafhsUWofBuilding LocalHost file/block scheduled...nodeCnt == 0Parsing application description...Identifying hosts...Spawning processes...Platform-MPI licensed for Platform-MPI Internal System Check.Process layout for world 0 is as follows:mpirun:  proc 30445  daemon proc 86259 on host 192.168.1.29    rank 0:  proc 86273    …….    rank 191:  proc 87530Output coll binary data file: pmpi810_coll_selection.datTotal tests to run 22Test  1 of 22: CompleteTest  2 of 22: Completesyscheck.x: Rank 0:79: MPI_Allgather: ibv_reg_mr() failed: addr 0x7ff62a618010, len 2097152syscheck.x: Rank 0:79: MPI_Allgather: Internal MPI errorMPI Application rank 79 exited before MPI_Finalize() with status 16[root@mgmt ~]#

Solution

This problem is not related to the Platform MPI libraries, but related to the OFED 1.5.3 Mellanox ConnectX HCA low-level driver.
The solution is to increase the mlx4_core option log_num_mtt by adding the following line to /etc/modprobe.conf :
 options mlx4_core log_num_mtt=24
See /usr/share/doc/ofed-docs-1.5.3/release_notes/mlx4_release_notes.txt :
log_num_mtt:           log maximum number of memory translation table                       segments per HCA (int)