Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tico model #1398

Merged
merged 12 commits into from
Jan 11, 2024
Merged

Add tico model #1398

merged 12 commits into from
Jan 11, 2024

Conversation

IgorSusmelj
Copy link
Contributor

@IgorSusmelj IgorSusmelj commented Sep 16, 2023

Changes

  • Add TiCo model code for ImageNet benchmark
  • Run experiments and add results to readme and benchmarks tab in the docs

Note that I'll have to rebase this PR on master as quite some time passed since I started this.

@codecov
Copy link

codecov bot commented Sep 16, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (67dc269) 85.49% compared to head (d97102b) 85.50%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1398   +/-   ##
=======================================
  Coverage   85.49%   85.50%           
=======================================
  Files         135      135           
  Lines        5655     5657    +2     
=======================================
+ Hits         4835     4837    +2     
  Misses        820      820           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@IgorSusmelj
Copy link
Contributor Author

This will be a more tough one. I was running several experiments and the loss is going up and accuracy stays down. Somehow the training seems unstable. I varied some of the hyperparameters but didn't get it running.

@IgorSusmelj
Copy link
Contributor Author

IgorSusmelj commented Jan 5, 2024

The training on ImageNet does not work as expected. We did several modifications to the loss but no matter what we do it ends up with having the loss saturing to the max value and the accuracy staying at 0%.

Epoch 3:  63%|██████▎   | 1583/2502 [14:05<08:10,  1.87it/s, v_num=0, train_loss=8.000, val_online_cls_loss=140.0, val_online_cls_top1=0.00106, val_online_cls_top5=0.00414]
...
Epoch 5:  10%|▉         | 238/2502 [02:11<20:47,  1.81it/s, v_num=0, train_loss=8.000, val_online_cls_loss=569.0, val_online_cls_top1=0.00094, val_online_cls_top5=0.00438]

Things we tried:

  • Detaching the auxiliary matrix to prevent backprop through it (suggestion from Guarin):
    B = torch.mm(z_a.T, z_a) / z_a.shape[0]
  • Toggling gather_distributed (suggestion from Guarin): but both, with and without we get the same results
  • With the new default settings (this PR) the loss saturates at 8.0
  • Variations of the hyperparameters (beta, rho of the loss and learning rate)
    image

I'll do one more run with more in depth logs to better isolate which parts of the loss go out of control :)

@guarin
Copy link
Contributor

guarin commented Jan 5, 2024

@IgorSusmelj could you quickly summarize the changes you tried?

@IgorSusmelj
Copy link
Contributor Author

The latest changes seem promising:
image

@IgorSusmelj IgorSusmelj marked this pull request as ready for review January 10, 2024 13:21
Copy link
Contributor

@guarin guarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's update the benchmarks once we figure out what is wrong with the model.

I'll also create follow-up issues to update BYOL and MoCo to use the correct backbone for evaluation.

| SimCLR\* + DCL | Res50 | 256 | 100 | 65.1 | 73.5 | 49.6 | [link](https://tensorboard.dev/experiment/k4ZonZ77QzmBkc0lXswQlg/) | [link](https://lightly-ssl-checkpoints.s3.amazonaws.com/imagenet_resnet50_dcl_2023-07-04_16-51-40/pretrain/version_0/checkpoints/epoch%3D99-step%3D500400.ckpt) |
| SimCLR\* + DCLW | Res50 | 256 | 100 | 64.5 | 73.2 | 48.5 | [link](https://tensorboard.dev/experiment/TrALnpwFQ4OkZV3uvaX7wQ/) | [link](https://lightly-ssl-checkpoints.s3.amazonaws.com/imagenet_resnet50_dclw_2023-07-07_14-57-13/pretrain/version_0/checkpoints/epoch%3D99-step%3D500400.ckpt) |
| SwAV | Res50 | 256 | 100 | 67.2 | 75.4 | 49.5 | [link](https://tensorboard.dev/experiment/Ipx4Oxl5Qkqm5Sl5kWyKKg) | [link](https://lightly-ssl-checkpoints.s3.amazonaws.com/imagenet_resnet50_swav_2023-05-25_08-29-14/pretrain/version_0/checkpoints/epoch%3D99-step%3D500400.ckpt) |
| TiCo | Res50 | 256 | 100 | 49.7 | 72.7 | 26.6 | - | [link](https://lightly-ssl-checkpoints.s3.amazonaws.com/imagenet_resnet50_tico_2024-01-07_18-40-57/pretrain/version_0/checkpoints/epoch%3D99-step%3D250200.ckpt) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that the linear accuracy is so low. In the paper they report linear accuracy that is similar to the other methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 epochs vs 1000 epochs could be the reason for that

@IgorSusmelj IgorSusmelj force-pushed the igor-lig-3068-add-tico-imagenet-benchmark branch from a9c4979 to d97102b Compare January 11, 2024 14:27
@IgorSusmelj IgorSusmelj merged commit deb3c31 into master Jan 11, 2024
8 checks passed
@IgorSusmelj IgorSusmelj deleted the igor-lig-3068-add-tico-imagenet-benchmark branch January 11, 2024 14:41
guarin pushed a commit that referenced this pull request Jan 19, 2024
* Add tico model

* Fix view

* Fix wrong hyperparam

* Fix hyperparam and make naming consistent

* Fix wrong loss

* Minor changes for debugging

* Cleanup

* Log individual losses. Detach B.

* Fix issues in code. Remove debugging logs.

* Add TiCo benchmarks results and checkpoints

* Update codebase and use naming from paper.

* Remove misleading comment about parameters for lr.
@guarin guarin mentioned this pull request Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants