Skip to content

ZJU-DAILY/BIRDIE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BIRDIE: Natural Language-Driven Table Discovery Using Differentiable Search Index

BIRDIE: An effective NL-driven table discovery framework using a differentiable search index. BIRDIE first assigns each table a prefix-aware identifier and leverages a LLM-based query generator to create synthetic queries for each table. It then encodes the mapping between synthetic queries/tables and their corresponding table identifiers into the parameters of an encoder-decoder language model, enabling deep query-table interactions. During search, the trained model directly generates table identifiers for a given query. To accommodate the continual indexing of dynamic tables, we introduce an index update strategy via parameter isolation, which mitigates the issue of catastrophic forgetting.

Requirements

  • Python 3.7
  • PyTorch 1.10.1
  • CUDA 11.5
  • NVIDIA 4090 GPUs

Please refer to the source code to install all required packages in Python.

Datasets

We use three benchmark datasets NQ-Tables, FetaQA, and OpenWikiTable.

Run Experimental Case

Scenario I : Indexing from scratch

  • Data preparation

    • Assign a table id for each table in the repository

      First, modify the dataset path in the data_info.json, and generate the representations of all tables

      cd BIRDIE/tableid/
      python emb.py --dataset_name "fetaqa" 
      

      Second, generate semantic IDs for each table through hierarchical clustering

      python hierarchical_clustering.py --dataset_name "fetaqa" --semantic_id_dir "BIRDIE/tableid/docid/"
      
    • Generate synthetic queries for each table

      Download the query generators and start the vllm service, then run the code below

      cd BIRDIE/query_generate/
      python query_g.py --dataset_name "fetaqa" --num 20 --tableid_path [Your path] --out_train_path [Your path]
      

      tableid_path is the path to the tableid file, num is the number of synthetic queries for each table, out_train_path is the path to the output file.

  • Train the model to index the tables in the repository

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run.py \
--task "Index" \
--train_file "./dataset/fetaqa/train.json" \ 
--valid_file "./dataset/fetaqa/test.json" \
--gradient_accumulation_steps 6 \
--max_steps 8000 \
--run_name "feta" \
--output_dir "./model/feta"
  • Search using the trained model
CUDA_VISIBLE_DEVICES=0 python3 run.py \
--task "Search" \
--train_file "./dataset/fetaqa/train.json" \
--valid_file "./dataset/fetaqa/test.json" \
--base_model_path "./model/feta/checkpoint-8000" \
--output_dir "./model/feta"

Scenario II : Index Update

  • Data preparation for D0

    • Assign a tabid for each table in the repository D0

      First, generate the representations of tables.

      cd BIRDIE/tableid/
      python emb.py --dataset_name "fetaqa_inc_0" 
      

      Second, generate semantic IDs for each table through hierarchical clustering.

      python hierarchical_clustering.py --dataset_name "fetaqa_inc_0" --semantic_id_dir "BIRDIE/tableid/docid/"
      
    • Generate synthetic queries for each table in D0, similar to the steps in "Indexing from scratch"

  • Train the model M0 on the repository D0

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run_cont.py \
--task "Index" \
--train_file "./dataset/fetaqa_inc/train_0.json" \
--valid_file "./dataset/fetaqa_inc/test_0.json" \
--gradient_accumulation_steps 6 \
--max_steps 7000 --run_name "feta_inc0" \
--output_dir "./model/feta_inc0" 
  • Data preparation for D1

    • Assign a tabid for each table in the repository D1 by running the incremental tabid assign algorithm

      First, generate the representations of tables

      cd BIRDIE/tableid/
      python emb.py --dataset_name "fetaqa_inc_1" 
      

      Second, generate semantic IDs for each table through incremental tabid assign algorithm

      python cluster_tree.py --dataset_name "fetaqa_inc_1" --base_tag "fetaqa_inc_0"
      
    • Generate synthetic queries for each table in D1, similar to the steps in "Indexing from scratch"

  • Train a memory unit L1 to index D1 based on the model M0 using LoRA

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python3 -m torch.distributed.launch --nproc_per_node=6 run_cont.py \
--task "Index" \
--base_model_path "./model/feta_inc0/checkpoint-7000" \
--train_file "./dataset/fetaqa_inc/train_1.json" \
--valid_file "./dataset/fetaqa_inc/test_1.json" \
--peft True \
--gradient_accumulation_steps 6 \
--max_steps 4000 --run_name "feta_LoRA_d1" \
--output_dir "./model/feta_LoRA_d1"
  • Search tables using the model M0 and the plug-and-play LoRA memory L1
CUDA_VISIBLE_DEVICES=0 python3 run_cont.py \
--task "Search" \
--valid_file \
"./dataset/fetaqa_inc/test_0+1.json" \
--LoRA_1 "./model/feta_LoRA_d1/checkpoint-4000" \
--num 2 \
--partition_0 "./dataset/fetaqa_inc/train_0.json" \
--partition_1 "./dataset/fetaqa_inc/train_1.json" \
--output_dir "./model/feta_LoRA_d1"

LLM-based Query Generators

We train query generators for tabular data, based on the refined Llama3-8b model, as detailed in our paper. We release our trained Query Generators. Users can also train their query generators for different table repositories.

Acknowledgments

The original datasets are from NQ-Tables, FetaQA, and OpenWikiTable.

We thank the previous studies on table discovery/retrieval Solo, DTR. We use part of the code of DSI-QG.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages