You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have two machines: one with direct internet access and the other requiring NAT to access the internet. I’ve set up a Tailscale network to connect the two machines, and they are able to ping each other and SSH via Tailscale IP addresses with very low latency (~3ms). The inter-machine communication via Tailscale relay works as expected.
I am trying to run the following command using Tailscale network:
However, this command fails without any output when running across the Tailscale network with NAT in between the machines. On the other hand, if both machines are within the same LAN, using Tailscale, the command runs successfully.
Steps to reproduce:
Set up Tailscale on two machines—one with direct internet access and one behind NAT.
Ensure both machines can ping each other and SSH using Tailscale IPs. git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi NCCL_DEBUG=TRACE mpirun --allow-run-as-root -np 2 -H 100.64.0.24:1,100.64.0.27:1 -x NCCL_SOCKET_IFNAME=tailscale0 -x NCCL_IB_DISABLE=1 --mca btl_tcp_if_include tailscale0 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
Run the command above with the specified environment variables and Tailscale network settings.
Observe that the command fails with no output.
Expected behavior:
The command should run successfully when using Tailscale over NAT, as it does when both machines are in the same LAN.
Additional Information:
The issue only occurs when the machines are connected via Tailscale with NAT.
The command works fine when both machines are in the same local network using Tailscale.
The NCCL_DEBUG=TRACE does not provide any useful logs when the command fails.
Could anyone suggest how to resolve this issue?
The text was updated successfully, but these errors were encountered:
Does MPI work at all in this configuration? Have you tried running any MPI "hello world" type programs using this setup? It could be that all_reduce_perf never even gets around to initializing NCCL if the MPI initialization fails...
I have two machines: one with direct internet access and the other requiring NAT to access the internet. I’ve set up a Tailscale network to connect the two machines, and they are able to ping each other and SSH via Tailscale IP addresses with very low latency (~3ms). The inter-machine communication via Tailscale relay works as expected.
I am trying to run the following command using Tailscale network:
NCCL_DEBUG=TRACE mpirun --allow-run-as-root -np 2 -H 100.64.0.24:1,100.64.0.27:1 -x NCCL_SOCKET_IFNAME=tailscale0 -x NCCL_IB_DISABLE=1 --mca btl_tcp_if_include tailscale0 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
However, this command fails without any output when running across the Tailscale network with NAT in between the machines. On the other hand, if both machines are within the same LAN, using Tailscale, the command runs successfully.
Steps to reproduce:
Set up Tailscale on two machines—one with direct internet access and one behind NAT.
Ensure both machines can ping each other and SSH using Tailscale IPs.
git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi NCCL_DEBUG=TRACE mpirun --allow-run-as-root -np 2 -H 100.64.0.24:1,100.64.0.27:1 -x NCCL_SOCKET_IFNAME=tailscale0 -x NCCL_IB_DISABLE=1 --mca btl_tcp_if_include tailscale0 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
Run the command above with the specified environment variables and Tailscale network settings.
Observe that the command fails with no output.
Expected behavior:
The command should run successfully when using Tailscale over NAT, as it does when both machines are in the same LAN.
Additional Information:
The issue only occurs when the machines are connected via Tailscale with NAT.
The command works fine when both machines are in the same local network using Tailscale.
The NCCL_DEBUG=TRACE does not provide any useful logs when the command fails.
Could anyone suggest how to resolve this issue?
The text was updated successfully, but these errors were encountered: