Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Continuous evaluations init commit (facebookresearch#325)
Summary: Create a script that continuously evaluates benchmarks as they become available from a pretraining. ![Uploading Screen Shot 2021-06-02 at 10.22.01 AM.png…]() ![Uploading Screen Shot 2021-06-02 at 10.22.19 AM.png…]() <img width="593" alt="Screen Shot 2021-06-02 at 10 22 37 AM" src="https://user-images.githubusercontent.com/25669348/120497511-7888c880-c38c-11eb-8bc1-78bacc5d968b.png"> <img width="1237" alt="Screen Shot 2021-06-02 at 10 22 59 AM" src="https://user-images.githubusercontent.com/25669348/120497575-85a5b780-c38c-11eb-9445-2076e15be888.png"> Next Steps: 1. Deal with sharded checkpoints and their conversion 1. Improve max_iteration logic 1. Extend to FB infra. 1. Write unit tests 1. Think about how these tricky evaluation tests: facebookresearch#325 (comment) 1. Try not to replicate so much logic in the class (e.g. get path names from vissl code, requires some refactoring). 1. Look into email notifications. Testing: 1. Run 8node Swav with 10 epochs with 3 different benchmark evaluations with different resource requirements. SUCCESS. json config: ``` { "params": { "training_checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints", "benchmarks": [ { "evaluation_name": "clevr_count_linear", "config_files": [ "config=config_local/eval_resnet_8gpu_transfer_clevr_count_linear_benchmark_suite_scheduler_test.yaml" ] }, { "evaluation_name": "clevr_dist_linear", "config_files": [ "config=config_local/eval_resnet_8gpu_transfer_clevr_dist_linear_benchmark_suite_scheduler_test.yaml" ] }, { "evaluation_name": "in1k_linear", "config_files": [ "config=config_local/eval_resnet_8gpu_transfer_in1k_linear_benchmark_suite_scheduler_test.yaml" ] } ], "evaluation_iter_freq": 600, "evaluation_phase_freq": 2, "evaluate_final_phase": true, "autoload_slurm_evaluator_checkpoint": false, "slurm_evaluator_checkpoint": null, "auto_retry_evaluations": true, "retry_evaluation_job_ids": [], "max_retries": 3, "pytorch_ports": [40050, 40051, 40052, 40053, 40054, 40055, 40056, 40057, 40058, 40059, 40060, 40061, 40062, 40063] }, "slurm_options": { "PARTITION": "learnfair" } } ``` Example snippet from `evaluation_metrics.json`: ``` { "model_final_checkpoint_phase9": [ { "checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints", "config_files": [ "config=config_local/eval_resnet_8gpu_transfer_clevr_count_linear_benchmark_suite_scheduler_test.yaml", "hydra.run.dir='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'", "config.CHECKPOINT.DIR='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints'", "config.SLURM.LOG_FOLDER='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'", "config.SLURM.LOG_FOLDER='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'", "config.SLURM.USE_SLURM=true", "config.MODEL.WEIGHTS_INIT.PARAMS_FILE='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/model_final_checkpoint_phase9.torch'" ], "evaluation_name": "clevr_count_linear", "job_id": "42410489", "metrics": { "test_accuracy_list_meter_top_1_res5": { "iteration": 822, "metric": 34.62, "train_phase_idx": 2 }, "train_accuracy_list_meter_top_1_res5": { "iteration": 822, "metric": 33.8514, "train_phase_idx": 2 } }, "num_retries": 1, "slurm_checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints", "slurm_log_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear", "slurm_state": "COMPLETED", "weights_init_params_file": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/model_final_checkpoint_phase9.torch" }, ... ``` The following hold: 1. Training completes appropriately, w/o errors. 1. Able to resume checkpoints. 1. Evaluation folder structure is as expected above. 1. Best Metrics are extracted. Pull Request resolved: facebookresearch#325 Reviewed By: prigoyal Differential Revision: D28901750 Pulled By: iseessel fbshipit-source-id: 732074043200ac51f3e709d5e67e686f26d36835
- Loading branch information