Incorrect connection of data points can happen when logging out of order with implicit step value #3278

cdalinghaus · 2025-01-03T10:44:16Z

🐛 Bug

Summary: When metrics are logged using only epoch parameter, step value is chosen incrementally. When this happens out of order (for example: asynchronous evaluation on a batch system), displaying them in an epoch/value graph will connect the lines incorrectly. This is because step is used to determine the order the data points are connected in.

To reproduce

Pseudo:

run.track(float(train_loss), name='train_loss', epoch=1)
run.track(float(eval_loss), name='eval_loss', epoch=1)
run.track(float(train_loss), name='train_loss', epoch=2)
run.track(float(train_loss), name='train_loss', epoch=3)
run.track(float(train_loss), name='train_loss', epoch=4)
run.track(float(eval_loss), name='eval_loss', epoch=3) # Out of order due to scheduling 
run.track(float(eval_loss), name='eval_loss', epoch=2) # Out of order due to scheduling

In my specific setting, eval_loss is calculated by a seperate process and saved to disk. Periodically, the main process (also running the training and the aim logger picks up the value from disk and logs it)

Expected behavior

Connection of data points in the graph is dictated by whatever is selected to be on the x axis.

Environment

Aim Version: v3.27.0
Python version: latest
pip version: latest
OS: Linux

Additional context

Workaround: Always also calculate and log current step value:

run.track(float(eval_loss), name='eval_loss', epoch=epoch, step=len(train_dataloader) * epoch)

The text was updated successfully, but these errors were encountered:

cdalinghaus added help wanted Extra attention is needed type / bug Issue type: something isn't working labels Jan 3, 2025

mihran113 mentioned this issue Jan 20, 2025

[fix] Resolve issues on data points connection on epoch alignment #3283

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect connection of data points can happen when logging out of order with implicit step value #3278

Incorrect connection of data points can happen when logging out of order with implicit step value #3278

cdalinghaus commented Jan 3, 2025

Incorrect connection of data points can happen when logging out of order with implicit step value #3278

Incorrect connection of data points can happen when logging out of order with implicit step value #3278

Comments

cdalinghaus commented Jan 3, 2025

🐛 Bug

To reproduce

Expected behavior

Environment

Additional context