Code of evaluation needed #4

riddiculous · 2024-03-13T05:22:17Z

Dear author,
I evaluate your results(results/deepseek_spider_validation_set/Predicted.txt) on my own evaluation code(Execution Accuracy), but I found the result(82.7) is not the same as what you provided in the paper(85.5). I wonder if there is any error with mine. Could you please public your evaluation code?

My predicted exec accuracy:
easy medium hard extra all
count 248 446 174 166 1034
===================== EXECUTION ACCURACY =====================
execution 0.927 0.901 0.741 0.566 0.827
Thank you!

MohammadrezaPourreza · 2024-03-13T13:14:13Z

@riddiculous Hi, thank you so much for your interest in this work, for the evaluation we have used the official Spider evaluation script from here. In addition, we also included the screenshot of the evaluation performance generated by the script. We also didn't use these flags: --plug_value --keep_distinct --progress_bar_for_each_datapoint for evaluation. Thanks

starrysky9959 · 2024-03-15T05:50:08Z

Dear author, I evaluate your results(results/deepseek_spider_validation_set/Predicted.txt) on my own evaluation code(Execution Accuracy), but I found the result(82.7) is not the same as what you provided in the paper(85.5). I wonder if there is any error with mine. Could you please public your evaluation code?

My predicted exec accuracy: easy medium hard extra all count 248 446 174 166 1034 ===================== EXECUTION ACCURACY ===================== execution 0.927 0.901 0.741 0.566 0.827 Thank you!

I get the same result with https://github.com/taoyds/test-suite-sql-eval

riddiculous · 2024-03-18T03:32:02Z

@riddiculous Hi, thank you so much for your interest in this work, for the evaluation we have used the official Spider evaluation script from here. In addition, we also included the screenshot of the evaluation performance generated by the script. We also didn't use these flags: --plug_value --keep_distinct --progress_bar_for_each_datapoint for evaluation. Thanks

Hi, using the provided script, I still got the same result, just as @starrysky9959 did.

MohammadrezaPourreza · 2024-03-19T14:49:02Z

@starrysky9959 @riddiculous Thank you for your feedback, we will update the paper and adjust the execution accuracy for the development set of Spider

cometyang · 2024-03-21T03:37:36Z

@MohammadrezaPourreza , I have difficulty to reproduce the results given in the paper. Could you please give more detailed description on how you did each step in README. Thanks in advance.

MohammadrezaPourreza · 2024-03-21T14:04:34Z

@cometyang Hi, thank you so much for your interest in our work. I have uploaded the submission file for DTS-SQL paper for BIRD benchmark which is easy to use and you just need to install requirements and run this script. Please make sure to change the path of dataset by changing these two global variables: BASE_DATASET_DIR = "dev.json"
BASE_DABATASES_DIR = "./dev_databases/"

cometyang · 2024-03-21T22:24:10Z

@MohammadrezaPourreza thanks for provide the evaluation code for connect the two models. I am currently evaluating on Spider-syn. In Table 6. it mentioned DeepSeek 7B Upper bound 85.5 78.1, but I only get 79.8 and 72.5, so I wondering whether I did something wrong during training. For DeepSeek 7B Full finetuning, I got similar result 69.1 and 56.1 which veery close to the DeepSeek 7B FTTuning 70.4 56.6 (Tab.6). If I understand paper correctly, if I use the filtered_finuting_dataset.csv for finetuinng the deepseek model & predict against the validation dataset, I should get the upperbound resuts on spider-syn dataset, am I right?

MohammadrezaPourreza · 2024-03-24T16:41:17Z

Thank you very much, @cometyang, for your interest in our research! I'm curious to know if you have used neftune_noise_alpha, quantization, or perhaps employed LoRA adapters in your experiments? The findings presented in our paper are based on full fine-tuning without the use of quantization or LoRA adapters. Additionally, it's worth noting that in our analysis, neftune_noise_alpha seemed to detrimentally affect performance.

cometyang · 2024-03-24T17:26:43Z

Hi @MohammadrezaPourreza, thanks for your reply. The reason DTS work looks interesting is it is the currently highest ranked 7B model in the leadboard of (https://bird-bench.github.io/), so I want to dive into the work and understand the gap between ideal situation and the trained model and may find ways for improving.

For the purpose of reproducing, I try to exactly follow the settings in your notebook. Are you suggesting that the code using for paper is different from the shared notebook? If so, could you please also share the code of full fine-tuning (I can change to fp16 and try other hyper-parameters), but i will be appreciated if you can share the setting to reproduce the work so that I can reduce CO2 emission and lhave ess frustration. :-)

Thanks again for sharing the research work, I feel it is interesting that by using two models can have this performance improvement, it is like an agents framework.

I modify the code you shared for BIRD and adapted it to spider-syn, compare to the paper reported number, this is following I obtained below. As you can see, there are noticeable differences, so I wonder where I made mistake.

Deepseek	Paper (Tab 6)	My Experiement	Diff
Full Finetuning (EX)	70.4	69.1	-1.3
Full Finetuning (EM)	56.6	56.1	-0.5
DTS-SQL (EX)	76.2	70.2	-6.0
DTS-SQL (EM)	68.9	62.0	-6.9
Upper bound (EX)	85.5	79.8	-5.7
Upper bound (EM)	78.1	72.5	-5.6

Evaluation command python evaluation.py --gold Gold.txt --pred Pred.txt --db $database_folder$ --eytpe all --table $dataset$/tables.json

kanseaveg · 2024-04-17T10:59:44Z

@cometyang May I ask whether the results in table3 in the paper are the same as the results you reproduced?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code of evaluation needed #4

Code of evaluation needed #4

riddiculous commented Mar 13, 2024 •

edited

Loading

MohammadrezaPourreza commented Mar 13, 2024

starrysky9959 commented Mar 15, 2024

riddiculous commented Mar 18, 2024

MohammadrezaPourreza commented Mar 19, 2024

cometyang commented Mar 21, 2024 •

edited

Loading

MohammadrezaPourreza commented Mar 21, 2024

cometyang commented Mar 21, 2024 •

edited

Loading

MohammadrezaPourreza commented Mar 24, 2024

cometyang commented Mar 24, 2024 •

edited

Loading

kanseaveg commented Apr 17, 2024

Code of evaluation needed #4

Code of evaluation needed #4

Comments

riddiculous commented Mar 13, 2024 • edited Loading

MohammadrezaPourreza commented Mar 13, 2024

starrysky9959 commented Mar 15, 2024

riddiculous commented Mar 18, 2024

MohammadrezaPourreza commented Mar 19, 2024

cometyang commented Mar 21, 2024 • edited Loading

MohammadrezaPourreza commented Mar 21, 2024

cometyang commented Mar 21, 2024 • edited Loading

MohammadrezaPourreza commented Mar 24, 2024

cometyang commented Mar 24, 2024 • edited Loading

kanseaveg commented Apr 17, 2024

riddiculous commented Mar 13, 2024 •

edited

Loading

cometyang commented Mar 21, 2024 •

edited

Loading

cometyang commented Mar 21, 2024 •

edited

Loading

cometyang commented Mar 24, 2024 •

edited

Loading