Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect connection of data points can happen when logging out of order with implicit step value #3278

Open
cdalinghaus opened this issue Jan 3, 2025 · 0 comments
Labels
help wanted Extra attention is needed type / bug Issue type: something isn't working

Comments

@cdalinghaus
Copy link

🐛 Bug

Summary: When metrics are logged using only epoch parameter, step value is chosen incrementally. When this happens out of order (for example: asynchronous evaluation on a batch system), displaying them in an epoch/value graph will connect the lines incorrectly. This is because step is used to determine the order the data points are connected in.

image

To reproduce

Pseudo:

run.track(float(train_loss), name='train_loss', epoch=1)
run.track(float(eval_loss), name='eval_loss', epoch=1)
run.track(float(train_loss), name='train_loss', epoch=2)
run.track(float(train_loss), name='train_loss', epoch=3)
run.track(float(train_loss), name='train_loss', epoch=4)
run.track(float(eval_loss), name='eval_loss', epoch=3) # Out of order due to scheduling 
run.track(float(eval_loss), name='eval_loss', epoch=2) # Out of order due to scheduling

In my specific setting, eval_loss is calculated by a seperate process and saved to disk. Periodically, the main process (also running the training and the aim logger picks up the value from disk and logs it)

Expected behavior

Connection of data points in the graph is dictated by whatever is selected to be on the x axis.

Environment

  • Aim Version: v3.27.0
  • Python version: latest
  • pip version: latest
  • OS: Linux

Additional context

Workaround: Always also calculate and log current step value:

run.track(float(eval_loss), name='eval_loss', epoch=epoch, step=len(train_dataloader) * epoch)

@cdalinghaus cdalinghaus added help wanted Extra attention is needed type / bug Issue type: something isn't working labels Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed type / bug Issue type: something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant