-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dce-umip-nemo.cc does not return #2
Comments
I instrumented dce-umip-nemo.cc to output full logging for all DCE classes. Using tcpdump, I found that on node 5 (which is the mobile node that is running the ping process), the last successful ping is at time 273: $ tcpdump -r dce-umip-nemo-5-0.pcap -nn -tt
In the above, the 'Acknowledgment' packets are Wi-Fi acks (which we can ignore). Between time 273 and time 279, the mobile node's position is such that it only sends 'echo request' packets but doesn't receive a reply again until time 280. At time 280, we can observe that the ICMPv6 echo reply is received, and this induces a loop so that the simulator doesn't move forward from this time (e.g. the Acknowledgment packet is never sent for this echo reply).
Now, if we look at the log output for these two cases (time 273 and time 280), we see:
and then at time 280:
Right after dce_gettimeofday(), there is a 'dce_sched_yield' call. This is the first instance in the logging trace that this appears. It then appears subsequently in the rest of the trace (until I control-C exited from the program). By gdb breaking at this point, I traced this back to this code in iputils/ping_common.c (line 580).
In this case, 'next' is 1, and in_flight() is zero, so the code enters the branch that calls sched_yield(), which seems to send the program into a spin loop with nothing to move (simulation) time forward. Setting |
Patch to fix: direct-code-execution/ns-3-dce#124 |
In preparing for possible DCE 1.11 release on Ubuntu 16.04, I encountered the following problem with a umip example. The version 1.11 targets are presently pulling the tip of the development repositories.
and the simulation does not return.
The problem is isolated to the dce-umip-nemo.cc example. The simulation enters an infinite loop at time 280.025306697s and not reach the scheduled stop time of 300s. The following command will yield the following output:
(repeats endlessly)
It seems from this logging that the same task is dequeued from the queue and rescheduled, over and over.
The ns-3 simulation event that occurs at this time is WifiNetDevice::ForwardUp(). This simulation is sending some pings over wifi, and the receipt and a forward up of one of these packets to the Linux stack is putting it into this loop.
If I change the fiber manager (Pthread or Ucontext) it makes no difference.
However, if I change the ns-3 random number stream (i.e., the random seed), the behavior disappears. For instance, this completes successfully for me:
By changing the run number, the random variable streams are different, and this affects the mobility (random walk) and probably also some Wi-Fi contention resolution, such that the problematic event is not triggered. However, it may be the case that if we let the simulation run for longer time, it would get similarly stuck at a later time. I also experienced some successful runs with RngRun=1 so the problem does not seem deterministic.
Parth debugged this a bit and provided the following information (I am quoting below):
Based on what I could understand from the log file gdb.txt, it's a polling operation which is being enqueued every time.
As far as I understand the dce implementation, the function void LinuxSocketImpl::Poll () should be invoked with the returned event that occurred. But it seems like it's never being called. Now, I'm not very sure if the linux socket factory is being used in this case, but I'm just guessing based on what I currently understand.
Also, I tried to output the m_active list's size every time the scheduler enqueues or dequeues something, and I could see that initially the queue size would go up to 12 or so, but later on it would remain to max 2, but majority of the times it would move back and forth from 1 (dequeue-next) to 0 (enqueue). This is probably because we keep adding the polling job to the queue and keep dequeuing it again and again.
With the condition of RngRun=2 it does work without a gdb attached, but runs infinitely under the gdb, similar to how RngRun=1 works without the gdb. It's actually pretty tough to determine if the current gdb execution could be trusted.
The text was updated successfully, but these errors were encountered: