Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL fails to run with Tailscale over NAT, works in LAN #1606

Open
helloburke opened this issue Feb 16, 2025 · 1 comment
Open

NCCL fails to run with Tailscale over NAT, works in LAN #1606

helloburke opened this issue Feb 16, 2025 · 1 comment

Comments

@helloburke
Copy link

I have two machines: one with direct internet access and the other requiring NAT to access the internet. I’ve set up a Tailscale network to connect the two machines, and they are able to ping each other and SSH via Tailscale IP addresses with very low latency (~3ms). The inter-machine communication via Tailscale relay works as expected.

I am trying to run the following command using Tailscale network:

NCCL_DEBUG=TRACE mpirun --allow-run-as-root -np 2 -H 100.64.0.24:1,100.64.0.27:1 -x NCCL_SOCKET_IFNAME=tailscale0 -x NCCL_IB_DISABLE=1 --mca btl_tcp_if_include tailscale0 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

However, this command fails without any output when running across the Tailscale network with NAT in between the machines. On the other hand, if both machines are within the same LAN, using Tailscale, the command runs successfully.

Steps to reproduce:

Set up Tailscale on two machines—one with direct internet access and one behind NAT.
Ensure both machines can ping each other and SSH using Tailscale IPs.
git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi NCCL_DEBUG=TRACE mpirun --allow-run-as-root -np 2 -H 100.64.0.24:1,100.64.0.27:1 -x NCCL_SOCKET_IFNAME=tailscale0 -x NCCL_IB_DISABLE=1 --mca btl_tcp_if_include tailscale0 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
Run the command above with the specified environment variables and Tailscale network settings.
Observe that the command fails with no output.
Expected behavior:

The command should run successfully when using Tailscale over NAT, as it does when both machines are in the same LAN.

Additional Information:

The issue only occurs when the machines are connected via Tailscale with NAT.
The command works fine when both machines are in the same local network using Tailscale.
The NCCL_DEBUG=TRACE does not provide any useful logs when the command fails.
Could anyone suggest how to resolve this issue?

@kiskra-nvidia
Copy link
Member

Does MPI work at all in this configuration? Have you tried running any MPI "hello world" type programs using this setup? It could be that all_reduce_perf never even gets around to initializing NCCL if the MPI initialization fails...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants