Welcome to the repository for TQL (Table Query Language) – an innovative project dedicated to simplifying the interaction between natural language and structured database queries. In an era where data is the backbone of decision-making, TQL emerges as a powerful tool to bridge the gap between everyday language and precise database queries. This README is your guide to understanding the project's journey, its significance, and the structure of this repository.
- Project Overview
- Why TQL?
- Data Overview
- How TQL Works
- The Models Behind TQL
- The Road Ahead
- Repository Structure
TQL (Table Query Language) is a cutting-edge initiative that offers a seamless transition from natural language understanding to precise SQL queries. With TQL, you can input plain English text and watch it transform into intelligently crafted SQL queries, all while adhering to the underlying database schema. Our project aims to democratize the process of querying databases, making it accessible even to individuals with minimal SQL expertise.
We mainly used two data sources:
-
Spider Dataset
-
KaggleDBQA
We explored the WikiSQL as well, but since our area of research was multi-tabled schemas, we used first two data sources.
We split the data into 2 CSV files:
- CSV containing schemas.
- CSV containing train-test data - All the queries across the two datasets that are used for training and validation.
In a world driven by data, the ability to access and manipulate databases is invaluable. TQL offers a natural language interface, turning everyday language into powerful SQL commands. Whether you are a seasoned data professional, a researcher seeking insights, a developer streamlining database interactions, or an everyday user in search of specific information, TQL makes data retrieval accessible to all.
TQL is really good at figuring out and matching schema info. It helps us understand the data model for any given query and the related underlying schema. This tool can handle any type of schema, no matter how many tables, columns, or rows there are in the dataset. What's cool is our way of understanding the data model – we don't have to read the actual data.
For any given query, TQL is able to parse and accurately map the schema information to understand the data model context of any given underlying schema. The tool is capable of handling any given schema, no matter how many tables, columns, or rows in a dataset.
Our unique approach to understanding the data model concept means we never have to read the underlying data. We just need the table names and the column names in the schema.
Sample test case:
The user inputs: What are the ids of the students who either registered or attended a course?
Output from ChatGPT
Although ChatGPT is able to generate the correct SQL query, it's not able to use the correct table names and only guesses what the table names would be.
Output from LLama2 model
If we pass all the tables we have to a state-of-the-art LLama2 model, we still get an incorrect query which doesn't give us correct results
Note: Passing all the tables in a schema would be highly impractical. Since passing details of 100s of tables in a schema would be a very inefficient process.
Output from TQL
Since TQL is able to understand the context in a data model, it is able to correctly map the right set of tables and which is then passed to a LLama2 model for query generation.
The journey through TQL involves the following steps:
TQL's journey begins with your simple request. Users provide queries in plain, everyday language. TQL is ready to interpret these queries and transform them into structured SQL commands.
We have given the user the option to upload a custom Excel file with the schema in the format below, where we are only concerned about the DDL and need no access to data to perform our TQL magic.
Additionally, the user can select which schemas they have access to so they can manually select which schema they want to run queries.
TQL is more than a language translator; it's a database expert. It understands the intricate structure and relationships within the database schema, ensuring data integrity and consistency.
With all the pieces in place, TQL generates a tailored SQL query that aligns perfectly with your natural language request. This process guarantees accurate and reliable results.
Sometimes, the road to answers can be a bit bumpy. If TQL detects that your input text doesn't relate to any database or schema, it will kindly ask for valid input text to ensure a successful query.
TQL relies on two separate workflows to perform its magic. These include:
- Mapping Logic: To understand the structure and relationships within the data.
- SQL Generation: To generate accurate SQL queries tailored to the input text.
We tested the TQL logic on quite a few models, some of these include:
- T5
- T5- For code generation
- Llama2
- SQL Coder
As we look to the future, TQL is far from reaching its final destination. There's a lot more to explore and improve, such as:
Now, you may be wondering where all the magic happens. Here's a glimpse of the repository structure:
.ipynb_checkpoints
: Checkpoint files generated by Jupyter Notebooks.Flask
: Files related to the Flask application.Templates
: HTML templates used in the Flask app.app.py
: The main Flask application file.
Model
: Files related to the project's machine learning model.baseModel.ipynb
: Jupyter Notebook containing the base model.
databaseDesign
: Database-related files.KaggleDBQA
: Files related to the Kaggle database.KaggleDBQA.csv
: Kaggle DBQA csv data.KaggleDBQAParser.ipynb
: Jupyter Notebook for parsing KaggleDBQA.KaggleDBQA_Parser.py
: Python script for parsing KaggleDBQA.KaggleDBQA_schema.csv
: Schema of KaggleDBQA.KaggleDBQA_tables.json
: JSON file for tables.
__init__.py
: Initialization file.
MySQL
: MySQL database-related files.DatabaseUtils.py
: Python script for database utilities.__init__.py
: Initialization file.databaseUtilsRunner.ipynb
: Jupyter Notebook for running database utilities.sqlite_to_mysql.py
: Python script for converting SQLite to MySQL.
Spider
: Spider-related files..ipynb_checkpoints
: Checkpoint files.dataLoad-checkpoint.ipynb
: Jupyter Notebook for data loading.__pycache__
: Python cache files.TableParserSpider.py
: Spider script.__init__.py
: Initialization file.dataLoad.ipynb
: Jupyter Notebook for data loading.tableParserSpiderRunner.ipynb
: Jupyter Notebook for running the spider.
WikiSQL
: Files related to WikiSQL data.WikiSQLDataLoad.ipynb
: Jupyter Notebook for loading WikiSQL data.wikiSQLprocesisng.ipynb
: Jupyter Notebook for processing WikiSQL data.__pycache__
: Python cache files.
main
: Main application files.TQLRunner.py
: Python script for TQL (Table Query Language) execution.tqlRunner.ipynb
: Jupyter Notebook for running TQL.
queryProcessing
: Query processing-related files.TableMapper.py
: Python script for mapping tables.tableMapperRunner.ipynb
: Jupyter Notebook for running the table mapper.
utils
: Utility functions and scripts.__init__.py
: Initialization file.utils.py
: Utility functions.