DeepSearcher is very good at answering complex queries. In this evaluation introduction, we provide some scripts to evaluate the performance of DeepSearcher vs. naive RAG.
The evaluation is based on the Recall metric:
Recall@K: The percentage of relevant documents that are retrieved among the top K documents returned by the search engine.
Currently, we support the multi-hop question answering dataset of 2WikiMultiHopQA. More dataset will be added in the future.
The main evaluation script is evaluate.py
.
Your can provide a config file, say eval_config.yaml
, to specify the LLM, embedding model, and other provider and parameters.
python evaluate.py \
--dataset 2wikimultihopqa \
--config_yaml ./eval_config.yaml \
--pre_num 5 \
--output_dir ./eval_output
pre_num
is the number of samples to evaluate, the more samples, the more accurate the results will be, but it will consume more time and your LLM api token usage.
After you have loaded the dataset into vectorDB in the first run, if you want to skip loading dataset again, you can set the flag --skip_load
in the command line.
For more arguments details, you can run
python evaluate.py --help