BIRDIE: An effective NL-driven table discovery framework using a differentiable search index. BIRDIE first assigns each table a prefix-aware identifier and leverages a LLM-based query generator to create synthetic queries for each table. It then encodes the mapping between synthetic queries/tables and their corresponding table identifiers into the parameters of an encoder-decoder language model, enabling deep query-table interactions. During search, the trained model directly generates table identifiers for a given query. To accommodate the continual indexing of dynamic tables, we introduce an index update strategy via parameter isolation, which mitigates the issue of catastrophic forgetting.
- Python 3.7
- PyTorch 1.10.1
- CUDA 11.5
- NVIDIA 4090 GPUs
Please refer to the source code to install all required packages in Python.
We use three benchmark datasets NQ-Tables, FetaQA, and OpenWikiTable.
Scenario I : Indexing from scratch
-
Data preparation
-
Assign a table id for each table in the repository
First, modify the dataset path in the data_info.json, and generate the representations of all tables
cd BIRDIE/tableid/ python emb.py --dataset_name "fetaqa"
Second, generate semantic IDs for each table through hierarchical clustering
python hierarchical_clustering.py --dataset_name "fetaqa" --semantic_id_dir "BIRDIE/tableid/docid/"
-
Generate synthetic queries for each table
Download the query generators and start the vllm service, then run the code below
cd BIRDIE/query_generate/ python query_g.py --dataset_name "fetaqa" --num 20 --tableid_path [Your path] --out_train_path [Your path]
tableid_path is the path to the tableid file, num is the number of synthetic queries for each table, out_train_path is the path to the output file.
-
-
Train the model to index the tables in the repository
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run.py \
--task "Index" \
--train_file "./dataset/fetaqa/train.json" \
--valid_file "./dataset/fetaqa/test.json" \
--gradient_accumulation_steps 6 \
--max_steps 8000 \
--run_name "feta" \
--output_dir "./model/feta"
- Search using the trained model
CUDA_VISIBLE_DEVICES=0 python3 run.py \
--task "Search" \
--train_file "./dataset/fetaqa/train.json" \
--valid_file "./dataset/fetaqa/test.json" \
--base_model_path "./model/feta/checkpoint-8000" \
--output_dir "./model/feta"
Scenario II : Index Update
-
Data preparation for D0
-
Assign a tabid for each table in the repository D0
First, generate the representations of tables.
cd BIRDIE/tableid/ python emb.py --dataset_name "fetaqa_inc_0"
Second, generate semantic IDs for each table through hierarchical clustering.
python hierarchical_clustering.py --dataset_name "fetaqa_inc_0" --semantic_id_dir "BIRDIE/tableid/docid/"
-
Generate synthetic queries for each table in D0, similar to the steps in "Indexing from scratch"
-
-
Train the model M0 on the repository D0
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run_cont.py \
--task "Index" \
--train_file "./dataset/fetaqa_inc/train_0.json" \
--valid_file "./dataset/fetaqa_inc/test_0.json" \
--gradient_accumulation_steps 6 \
--max_steps 7000 --run_name "feta_inc0" \
--output_dir "./model/feta_inc0"
-
Data preparation for D1
-
Assign a tabid for each table in the repository D1 by running the incremental tabid assign algorithm
First, generate the representations of tables
cd BIRDIE/tableid/ python emb.py --dataset_name "fetaqa_inc_1"
Second, generate semantic IDs for each table through incremental tabid assign algorithm
python cluster_tree.py --dataset_name "fetaqa_inc_1" --base_tag "fetaqa_inc_0"
-
Generate synthetic queries for each table in D1, similar to the steps in "Indexing from scratch"
-
-
Train a memory unit L1 to index D1 based on the model M0 using LoRA
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run_cont.py \
--task "Index" \
--base_model_path "./model/feta_inc0/checkpoint-7000" \
--train_file "./dataset/fetaqa_inc/train_1.json" \
--valid_file "./dataset/fetaqa_inc/test_1.json" \
--peft True \
--gradient_accumulation_steps 6 \
--max_steps 4000 --run_name "feta_LoRA_d1" \
--output_dir "./model/feta_LoRA_d1"
- Search tables using the model M0 and the plug-and-play LoRA memory L1
CUDA_VISIBLE_DEVICES=0 python3 run_cont.py \
--task "Search" \
--valid_file \
"./dataset/fetaqa_inc/test_0+1.json" \
--LoRA_1 "./model/feta_LoRA_d1/checkpoint-4000" \
--num 2 \
--partition_0 "./dataset/fetaqa_inc/train_0.json" \
--partition_1 "./dataset/fetaqa_inc/train_1.json" \
--output_dir "./model/feta_LoRA_d1"
We train query generators for tabular data, based on the refined Llama3-8b model, as detailed in our paper. We release our trained Query Generators. Users can also train their query generators for different table repositories.
The original datasets are from NQ-Tables, FetaQA, and OpenWikiTable.
We thank the previous studies on table discovery/retrieval Solo, DTR. We use part of the code of DSI-QG.