-
Notifications
You must be signed in to change notification settings - Fork 512
Description
Hi,
I have a question about the different eager protocol implementations on my InfiniBand system. When running ucx_info, I observe the output shown below:
$ ucx_info -e -u t -D net
# UCP endpoint
#
# peer: <no debug data>
# lane[0]: 6:rc_mlx5/mlx5_0:1.0 md[1] -> md[1]/ib/sysdev[1] seg 4294967295 rma_bw#0 am wireup
#
# +---------------------+----------------------------------------------------------------------+
# | ucx_info self cfg#0 | tagged message by ucp_tag_send*(fast-completion) from host memory |
# +---------------------+---------------------------------------------------+------------------+
# | 0..2038 | eager short | rc_mlx5/mlx5_0:1 |
# | 2039..8246 | eager copy-in copy-out | rc_mlx5/mlx5_0:1 |
# | 8247..16676 | multi-frag eager copy-in copy-out | rc_mlx5/mlx5_0:1 |
# | 16677..262143 | multi-frag eager zero-copy copy-out | rc_mlx5/mlx5_0:1 |
# | 256K..inf | (?) rendezvous zero-copy read from remote | rc_mlx5/mlx5_0:1 |
# +---------------------+---------------------------------------------------+------------------+
#
# +---------------------+--------------------------------------------------------------+
# | ucx_info self cfg#0 | tagged message by ucp_tag_send*(multi) from host memory |
# +---------------------+-------------------------------------------+------------------+
# | 0..514 | eager short | rc_mlx5/mlx5_0:1 |
# | 515..4066 | eager zero-copy copy-out | rc_mlx5/mlx5_0:1 |
# | 4067..inf | (?) rendezvous zero-copy read from remote | rc_mlx5/mlx5_0:1 |
# +---------------------+-------------------------------------------+------------------+
#
# +---------------------+--------------------------------------------------------------+
# | ucx_info self cfg#0 | tagged message by ucp_tag_send* from host memory |
# +---------------------+-------------------------------------------+------------------+
# | 0..2038 | eager short | rc_mlx5/mlx5_0:1 |
# | 2039..8246 | eager zero-copy copy-out | rc_mlx5/mlx5_0:1 |
# | 8247..35888 | multi-frag eager zero-copy copy-out | rc_mlx5/mlx5_0:1 |
# | 35889..inf | (?) rendezvous zero-copy read from remote | rc_mlx5/mlx5_0:1 |
# +---------------------+-------------------------------------------+------------------+
In the fast-completion section, I see that after eager short, UCX uses eager copy-in/copy-out, which I understand corresponds to the bcopy path. In contrast, the third section shows the protocols and message size limits for the same eager operation but for default-completion. From my understanding, this means UCX generates and returns a request object in this case.
What confuses me is that in the default-completion path, the bounce buffer implementation does not seem to be used, and instead zcopy is selected. I would expect zcopy to require memory pinning even in the default-completion path, which should introduce additional overhead for smaller message sizes—sizes that would normally benefit from a bounce-buffer-based approach.
Could someone please explain the rationale behind this behavior and why zcopy is preferred here instead of bcopy/bounce buffering?
Thanks in advance for any clarification.
Setup and Versions
CPU: Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
$ lspci | grep Mellanox
5e:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
$ ofed_info -s
OFED-internal-25.10-1.2.8
$ ucx_info -v
# Library version: 1.21.0