BIRDIE: Natural Language-Driven Table Discovery Using Differentiable Search Index

BIRDIE: An effective NL-driven table discovery framework using a differentiable search index. BIRDIE first assigns each table a prefix-aware identifier and leverages a LLM-based query generator to create synthetic queries for each table. It then encodes the mapping between synthetic queries/tables and their corresponding table identifiers into the parameters of an encoder-decoder language model, enabling deep query-table interactions. During search, the trained model directly generates table identifiers for a given query. To accommodate the continual indexing of dynamic tables, we introduce an index update strategy via parameter isolation, which mitigates the issue of catastrophic forgetting.

Requirements

Python 3.7
PyTorch 1.10.1
CUDA 11.5
NVIDIA 4090 GPUs

Please refer to the source code to install all required packages in Python.

Datasets

We use three benchmark datasets NQ-Tables, FetaQA, and OpenWikiTable.

Run Experimental Case

Scenario I : Indexing from scratch

Data preparation
- Assign a table id for each table in the repository
  
  First, modify the dataset path in the data_info.json, and generate the representations of all tables
```
cd BIRDIE/tableid/
python emb.py --dataset_name "fetaqa" 
```
  Second, generate semantic IDs for each table through hierarchical clustering
```
python hierarchical_clustering.py --dataset_name "fetaqa" --semantic_id_dir "BIRDIE/tableid/docid/"
```
- Generate synthetic queries for each table
  
  Download the query generators and start the vllm service, then run the code below
```
cd BIRDIE/query_generate/
python query_g.py --dataset_name "fetaqa" --num 20 --tableid_path [Your path] --out_train_path [Your path]
```
  tableid_path is the path to the tableid file, num is the number of synthetic queries for each table, out_train_path is the path to the output file.
Train the model to index the tables in the repository

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run.py \
--task "Index" \
--train_file "./dataset/fetaqa/train.json" \ 
--valid_file "./dataset/fetaqa/test.json" \
--gradient_accumulation_steps 6 \
--max_steps 8000 \
--run_name "feta" \
--output_dir "./model/feta"

Search using the trained model

CUDA_VISIBLE_DEVICES=0 python3 run.py \
--task "Search" \
--train_file "./dataset/fetaqa/train.json" \
--valid_file "./dataset/fetaqa/test.json" \
--base_model_path "./model/feta/checkpoint-8000" \
--output_dir "./model/feta"

Scenario II : Index Update

Data preparation for D⁰
- Assign a tabid for each table in the repository D⁰
  
  First, generate the representations of tables.
```
cd BIRDIE/tableid/
python emb.py --dataset_name "fetaqa_inc_0" 
```
  Second, generate semantic IDs for each table through hierarchical clustering.
```
python hierarchical_clustering.py --dataset_name "fetaqa_inc_0" --semantic_id_dir "BIRDIE/tableid/docid/"
```
- Generate synthetic queries for each table in D⁰, similar to the steps in "Indexing from scratch"
Train the model M⁰ on the repository D⁰

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run_cont.py \
--task "Index" \
--train_file "./dataset/fetaqa_inc/train_0.json" \
--valid_file "./dataset/fetaqa_inc/test_0.json" \
--gradient_accumulation_steps 6 \
--max_steps 7000 --run_name "feta_inc0" \
--output_dir "./model/feta_inc0"

Data preparation for D¹
- Assign a tabid for each table in the repository D¹ by running the incremental tabid assign algorithm
  
  First, generate the representations of tables
```
cd BIRDIE/tableid/
python emb.py --dataset_name "fetaqa_inc_1" 
```
  Second, generate semantic IDs for each table through incremental tabid assign algorithm
```
python cluster_tree.py --dataset_name "fetaqa_inc_1" --base_tag "fetaqa_inc_0"
```
- Generate synthetic queries for each table in D¹, similar to the steps in "Indexing from scratch"
Train a memory unit L¹ to index D¹ based on the model M⁰ using LoRA

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run_cont.py \
--task "Index" \
--base_model_path "./model/feta_inc0/checkpoint-7000" \
--train_file "./dataset/fetaqa_inc/train_1.json" \
--valid_file "./dataset/fetaqa_inc/test_1.json" \
--peft True \
--gradient_accumulation_steps 6 \
--max_steps 4000 --run_name "feta_LoRA_d1" \
--output_dir "./model/feta_LoRA_d1"

Search tables using the model M⁰ and the plug-and-play LoRA memory L¹

CUDA_VISIBLE_DEVICES=0 python3 run_cont.py \
--task "Search" \
--valid_file \
"./dataset/fetaqa_inc/test_0+1.json" \
--LoRA_1 "./model/feta_LoRA_d1/checkpoint-4000" \
--num 2 \
--partition_0 "./dataset/fetaqa_inc/train_0.json" \
--partition_1 "./dataset/fetaqa_inc/train_1.json" \
--output_dir "./model/feta_LoRA_d1"

LLM-based Query Generators

We train query generators for tabular data, based on the refined Llama3-8b model, as detailed in our paper. We release our trained Query Generators. Users can also train their query generators for different table repositories.

Acknowledgments

The original datasets are from NQ-Tables, FetaQA, and OpenWikiTable.

We thank the previous studies on table discovery/retrieval Solo, DTR. We use part of the code of DSI-QG.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
dataset		dataset
query_generate		query_generate
tableid		tableid
README.md		README.md
data.py		data.py
run.py		run.py
run_cont.py		run_cont.py
technical_report.pdf		technical_report.pdf
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIRDIE: Natural Language-Driven Table Discovery Using Differentiable Search Index

Requirements

Datasets

Run Experimental Case

LLM-based Query Generators

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

ZJU-DAILY/BIRDIE

Folders and files

Latest commit

History

Repository files navigation

BIRDIE: Natural Language-Driven Table Discovery Using Differentiable Search Index

Requirements

Datasets

Run Experimental Case

LLM-based Query Generators

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages