-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code of evaluation needed #4
Comments
@riddiculous Hi, thank you so much for your interest in this work, for the evaluation we have used the official Spider evaluation script from here. In addition, we also included the screenshot of the evaluation performance generated by the script. We also didn't use these flags: --plug_value --keep_distinct --progress_bar_for_each_datapoint for evaluation. Thanks |
I get the same result with https://github.com/taoyds/test-suite-sql-eval |
Hi, using the provided script, I still got the same result, just as @starrysky9959 did. |
@starrysky9959 @riddiculous Thank you for your feedback, we will update the paper and adjust the execution accuracy for the development set of Spider |
@MohammadrezaPourreza , I have difficulty to reproduce the results given in the paper. Could you please give more detailed description on how you did each step in README. Thanks in advance. |
@cometyang Hi, thank you so much for your interest in our work. I have uploaded the submission file for DTS-SQL paper for BIRD benchmark which is easy to use and you just need to install requirements and run this script. Please make sure to change the path of dataset by changing these two global variables: BASE_DATASET_DIR = "dev.json" |
@MohammadrezaPourreza thanks for provide the evaluation code for connect the two models. I am currently evaluating on Spider-syn. In Table 6. it mentioned DeepSeek 7B Upper bound 85.5 78.1, but I only get 79.8 and 72.5, so I wondering whether I did something wrong during training. For DeepSeek 7B Full finetuning, I got similar result 69.1 and 56.1 which veery close to the DeepSeek 7B FTTuning 70.4 56.6 (Tab.6). If I understand paper correctly, if I use the filtered_finuting_dataset.csv for finetuinng the deepseek model & predict against the validation dataset, I should get the upperbound resuts on spider-syn dataset, am I right? |
Thank you very much, @cometyang, for your interest in our research! I'm curious to know if you have used neftune_noise_alpha, quantization, or perhaps employed LoRA adapters in your experiments? The findings presented in our paper are based on full fine-tuning without the use of quantization or LoRA adapters. Additionally, it's worth noting that in our analysis, neftune_noise_alpha seemed to detrimentally affect performance. |
Hi @MohammadrezaPourreza, thanks for your reply. The reason DTS work looks interesting is it is the currently highest ranked 7B model in the leadboard of (https://bird-bench.github.io/), so I want to dive into the work and understand the gap between ideal situation and the trained model and may find ways for improving. For the purpose of reproducing, I try to exactly follow the settings in your notebook. Are you suggesting that the code using for paper is different from the shared notebook? If so, could you please also share the code of full fine-tuning (I can change to fp16 and try other hyper-parameters), but i will be appreciated if you can share the setting to reproduce the work so that I can reduce CO2 emission and lhave ess frustration. :-) Thanks again for sharing the research work, I feel it is interesting that by using two models can have this performance improvement, it is like an agents framework. I modify the code you shared for BIRD and adapted it to spider-syn, compare to the paper reported number, this is following I obtained below. As you can see, there are noticeable differences, so I wonder where I made mistake.
Evaluation command |
@cometyang May I ask whether the results in table3 in the paper are the same as the results you reproduced? |
Dear author,
I evaluate your results(results/deepseek_spider_validation_set/Predicted.txt) on my own evaluation code(Execution Accuracy), but I found the result(82.7) is not the same as what you provided in the paper(85.5). I wonder if there is any error with mine. Could you please public your evaluation code?
My predicted exec accuracy:
easy medium hard extra all
count 248 446 174 166 1034
===================== EXECUTION ACCURACY =====================
execution 0.927 0.901 0.741 0.566 0.827
Thank you!
The text was updated successfully, but these errors were encountered: