-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems in reproducing the RL fine-tuned results #30
Comments
@abhik1505040 Thanks for reporting the observations. The RL finetuning stage can be quite sensitive to hyperparameters. Based on my experience, you should experiment with a larger batch size e.g. 256 samples per training step, and experiment with lower learning rates. Another trick is to have a new LM head for RL training iterations. We could initialize this head as a clone from the fine-tuned checkpoint of the original LM head following this. This strategy can help to stabilize the finetuning with RL for T5 models. But in some cases e.g. in GPT-J experiments, I found the benefit not too significant. |
Yeah. I'm suffering from the same failure cases. I haven't got the number for the fine-tuned model with generated code yet but it should be similar to yours @abhik1505040 . Especially, in many files, it just generates some repetitive texts like |
@henryhungle Thank you very much for the pointers. I'll give them a try! |
I'm also facing the same issue! |
@abhik1505040 Hi, I want to know the result for pass@1 of your fine-tuned model with CE loss for 10 epochs. My result pass@1 It's much lower than the one in the paper, but pass@5 is similar to yours and the results of the paper are similar. |
Hi @sssszh, apologies for the late response; I observed similar poor performance for pass@1 as well. The exact score was 0.67 |
Hi, @abhik1505040 |
Hi, folks, I also get pass@1 approximately 1% but pass@5 at 2.4% with CE loss fine-tuned model. After trying a bunch of temperature, 0.2 seems get me the best pass@1 at 1.1%. I wonder does anyone have anyone have any updates on reproducing the CE fine-tuned model? Thanks a lot!! @doviettung96 @abhik1505040 @sssszh |
Hi, thanks for open-sourcing your amazing work!
I have been trying to reproduce the RL fine-tuned results reported in the paper, but unfortunately, I am encountering some issues. Here is a brief overview of the steps I followed:
Fine-tuned the actor model with CE loss for 10 epochs with
train_actor.sh
and the CodeT5-NTP model. This fine-tuned model gives similar results to the paper (2.86 pass@5 compared to 2.90 in the paper)With some modifications to
generate.py
, generated 20 candidate samples per problem (following the sample files given in the repo) and greedy baseline codes for the training set with the CE fine-tuned model. Theresult
key required for the correspondinggen_solutions.json
andbaseline_solutions.json
was generated with this snippet.Generated the token level hidden states/critic scores with the released critic model through
generate_critic_scores.sh
.RL-finetuning with the default hyperparameters present in
train_actor_rl.sh
, the RL-finetuned model gives very degraded results. (0.84 pass@5)I would greatly appreciate any suggestions you may have on hyperparameter choices or other settings that could help me reproduce the RL-finetuned results accurately.
Many thanks!
The text was updated successfully, but these errors were encountered: