Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code of evaluation needed #4

Open
riddiculous opened this issue Mar 13, 2024 · 10 comments
Open

Code of evaluation needed #4

riddiculous opened this issue Mar 13, 2024 · 10 comments

Comments

@riddiculous
Copy link

riddiculous commented Mar 13, 2024

Dear author,
I evaluate your results(results/deepseek_spider_validation_set/Predicted.txt) on my own evaluation code(Execution Accuracy), but I found the result(82.7) is not the same as what you provided in the paper(85.5). I wonder if there is any error with mine. Could you please public your evaluation code?

My predicted exec accuracy:
easy medium hard extra all
count 248 446 174 166 1034
===================== EXECUTION ACCURACY =====================
execution 0.927 0.901 0.741 0.566 0.827
Thank you!

@MohammadrezaPourreza
Copy link
Owner

@riddiculous Hi, thank you so much for your interest in this work, for the evaluation we have used the official Spider evaluation script from here. In addition, we also included the screenshot of the evaluation performance generated by the script. We also didn't use these flags: --plug_value --keep_distinct --progress_bar_for_each_datapoint for evaluation. Thanks

@starrysky9959
Copy link

Dear author, I evaluate your results(results/deepseek_spider_validation_set/Predicted.txt) on my own evaluation code(Execution Accuracy), but I found the result(82.7) is not the same as what you provided in the paper(85.5). I wonder if there is any error with mine. Could you please public your evaluation code?

My predicted exec accuracy: easy medium hard extra all count 248 446 174 166 1034 ===================== EXECUTION ACCURACY ===================== execution 0.927 0.901 0.741 0.566 0.827 Thank you!

image

I get the same result with https://github.com/taoyds/test-suite-sql-eval

@riddiculous
Copy link
Author

@riddiculous Hi, thank you so much for your interest in this work, for the evaluation we have used the official Spider evaluation script from here. In addition, we also included the screenshot of the evaluation performance generated by the script. We also didn't use these flags: --plug_value --keep_distinct --progress_bar_for_each_datapoint for evaluation. Thanks

Hi, using the provided script, I still got the same result, just as @starrysky9959 did.

@MohammadrezaPourreza
Copy link
Owner

@starrysky9959 @riddiculous Thank you for your feedback, we will update the paper and adjust the execution accuracy for the development set of Spider

@cometyang
Copy link

cometyang commented Mar 21, 2024

@MohammadrezaPourreza , I have difficulty to reproduce the results given in the paper. Could you please give more detailed description on how you did each step in README. Thanks in advance.

@MohammadrezaPourreza
Copy link
Owner

@cometyang Hi, thank you so much for your interest in our work. I have uploaded the submission file for DTS-SQL paper for BIRD benchmark which is easy to use and you just need to install requirements and run this script. Please make sure to change the path of dataset by changing these two global variables: BASE_DATASET_DIR = "dev.json"
BASE_DABATASES_DIR = "./dev_databases/"

@cometyang
Copy link

cometyang commented Mar 21, 2024

@MohammadrezaPourreza thanks for provide the evaluation code for connect the two models. I am currently evaluating on Spider-syn. In Table 6. it mentioned DeepSeek 7B Upper bound 85.5 78.1, but I only get 79.8 and 72.5, so I wondering whether I did something wrong during training. For DeepSeek 7B Full finetuning, I got similar result 69.1 and 56.1 which veery close to the DeepSeek 7B FTTuning 70.4 56.6 (Tab.6). If I understand paper correctly, if I use the filtered_finuting_dataset.csv for finetuinng the deepseek model & predict against the validation dataset, I should get the upperbound resuts on spider-syn dataset, am I right?

@MohammadrezaPourreza
Copy link
Owner

Thank you very much, @cometyang, for your interest in our research! I'm curious to know if you have used neftune_noise_alpha, quantization, or perhaps employed LoRA adapters in your experiments? The findings presented in our paper are based on full fine-tuning without the use of quantization or LoRA adapters. Additionally, it's worth noting that in our analysis, neftune_noise_alpha seemed to detrimentally affect performance.

@cometyang
Copy link

cometyang commented Mar 24, 2024

Hi @MohammadrezaPourreza, thanks for your reply. The reason DTS work looks interesting is it is the currently highest ranked 7B model in the leadboard of (https://bird-bench.github.io/), so I want to dive into the work and understand the gap between ideal situation and the trained model and may find ways for improving.

For the purpose of reproducing, I try to exactly follow the settings in your notebook. Are you suggesting that the code using for paper is different from the shared notebook? If so, could you please also share the code of full fine-tuning (I can change to fp16 and try other hyper-parameters), but i will be appreciated if you can share the setting to reproduce the work so that I can reduce CO2 emission and lhave ess frustration. :-)

Thanks again for sharing the research work, I feel it is interesting that by using two models can have this performance improvement, it is like an agents framework.

I modify the code you shared for BIRD and adapted it to spider-syn, compare to the paper reported number, this is following I obtained below. As you can see, there are noticeable differences, so I wonder where I made mistake.

Deepseek Paper (Tab 6) My Experiement Diff
Full Finetuning (EX) 70.4 69.1 -1.3
Full Finetuning (EM) 56.6 56.1 -0.5
DTS-SQL (EX) 76.2 70.2 -6.0
DTS-SQL (EM) 68.9 62.0 -6.9
Upper bound (EX) 85.5 79.8 -5.7
Upper bound (EM) 78.1 72.5 -5.6

Evaluation command python evaluation.py --gold Gold.txt --pred Pred.txt --db $database_folder$ --eytpe all --table $dataset$/tables.json

@kanseaveg
Copy link

@cometyang May I ask whether the results in table3 in the paper are the same as the results you reproduced?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants