Our main contribution is a scalable fact-checking system which provides two main features:
- Question answering
- Fact checking
Our system is a combination of multiple components ranging from NLI, QA, IR in NLP:
- Retriever: retrieve a set of most relevant data related to the content that the user requests
- Reader: search and extract answer for the question from the user, given the relevant data from the Retriever
- Inferrer: classify each data (evidence) in the set of most relevant data from the Retriever given the user's claim
Instruction to reproduce the experiment step-by-step for the testset of the MLQA dataset
Python: 3.7.5
pip install -r requirements.txt
also, install missing packets for FAISS library:
sudo apt-get install libopenblas-dev
sudo apt-get install libomp-dev
Clone:
git clone https://github.com/icesonata/docker-es-cococ-tokenizer.git
Deploy:
docker-compose up -d
*Note: this may require sudo
privilege.
Create index with title
and content
fields via API :
(for cURL
, move to the Alternative below the payload)
Send to localhost:9200/vi_mlqa_test
with PUT
method and the payload below.
{
"settings": {
"index": {
"number_of_shards" : 1,
"number_of_replicas" : 1,
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "vi_tokenizer",
"char_filter": [ "html_strip" ],
"filter": [
"icu_folding"
]
}
}
}
}
},
"mappings": {
"properties" : {
"title" : {
"type" : "text",
"analyzer": "my_analyzer"
},
"content" : {
"type" : "text",
"analyzer" : "my_analyzer"
}
}
}
}
*Alternatively, using cURL
:
curl -XPUT "http://localhost:9200/vi_mlqa_test" -H 'Content-Type: application/json' -d'{"settings": {"index":{"number_of_shards":1,"number_of_replicas":1,"analysis":{"analyzer":{"my_analyzer":{"tokenizer":"vi_tokenizer","char_filter":["html_strip"],"filter":["icu_folding"]}}}}},"mappings":{"properties":{"title" :{"type":"text","analyzer":"my_analyzer"},"content":{"type":"text","analyzer":"my_analyzer"}}}}'
Misc
Check the existing indices:
curl -XGET localhost:9200/_cat/indices
Count the number of documents in an index:
curl -XGET localhost:9200/vi_mlqa_test/_count
Setup MySQL with password=root
and host port=15432
docker run --name db_index -e MYSQL_ROOT_PASSWORD=root -p 15432:3306 -d mysql:latest
Get into to the MySQL container:
docker exec -it db_index /bin/bash
*Note: this step may require sudo
privilege.
Get into MySQL server in the container:
mysql -uroot -p
*Note: enter password=root
when the server requires authentication.
Create a database, named corpus
:
CREATE corpus;
USE corpus;
Create a new user:
CREATE USER 'longnguyen'@'%' IDENTIFIED BY 'longnguyen';
Grant the user with privilege enough for access on the corpus
database from SQLAlchemy:
GRANT ALL PRIVILEGES ON test_db.* to 'longnguyen'@'%';
Create document table:
CREATE TABLE mlqa_test_articles(id int not null auto_increment, title text, content longtext, publish_date varchar(50), primary key(id)) character set utf8mb4 collate utf8mb4_general_ci;
Create sentence table:
CREATE TABLE mlqa_test_sent_articles(id int not null auto_increment, sentence text, doc_id int, primary key(id), foreign key(doc_id) references mlqa_test_articles(id) on delete cascade) character set utf8mb4 collate utf8mb4_unicode_ci;
Misc
Check tables:
SHOW tables;
DESCRIBE mlqa_test_articles;
DESCRIBE mlqa_test_sent_articles;
Count number of entries in the tables:
SELECT COUNT(*) FROM mlqa_test_articles;
SELECT COUNT(*) FROM mlqa_test_sent_articles;
Move to dataset/
directory.
cd dataset/
ElasticSearch
python import_es.py
MySQL
python import_db.py
For deployment, make sure ElasticSearch and MySQL are working.
This step requires 3 separate shells:
- Backend
- Encoder
- Frontend
Move to backend/
directory to deploy server on the 0.0.0.0
with port 8888
by running the command below:
python manage.py runserver 0.0.0.0:8888
Move to encoder/
directory and change ROOT_DIR
to an absolute path to the directory of the project, e.g.,
ROOT_DIR = "/home/username/FactCheck-QA/"
Then, run the command below:
python encoder_server.py
*Note: change serving address of the Encoder via serve()
function of the encoder/encoder_server.py
NodeJS: >= 16.0
*Note: use nvm for switching to newer nodeJS version.
Move to frontend/
directory and run the command below once to install the dependencies:
npm install
Then, everytime frontend needs to be deployed, just run the command below:
npm run dev
*Note: frontend interacts with backend via API. You can change backend address in frontend/src/@core/utils/api/api.js
Look into dataset/[Research]_Sentence_processing_for_SquAD_format_dataset.ipynb
or you can reuse the available resources of the MLQA dataset we provide.
API format and relevant documents of the backend can be found in backend/docs
.
The system serves three services through API, you can request it via API endpoints below, given a field namely data
as a form data:
localhost:8888/api/search/relevance/
: information retrieval, retrieve relevant data given a piece of informationlocalhost:8888/api/search/answering/
: question answering, answer a question given by userlocalhost:8888/api/search/inference/
: fact-checking, returns a list of evidence support/refute the given claim given by user
*Note: The url strictly requires the trailing slash /
at the end. Also, replace the address if the backend runs on different port or address.
There are some note for this project:
- There is a indexing mismatch between the indices that mentioned in the comments, for example:
- MySQL: 1-index
- ElasticSearch: 0-index
- FAISS: 0-index
- How to change language models which are downloaded from HuggingFace: there are two files to be concerned
backend/apps/search/components/config.py
:READER_MODEL
: question-answering modelINFERRER_MODEL
: natural language inference model, text-classification following NLI model, or zero-shot classification model. Note that different models have different output style, hence we must config ourbackend/apps/search/components/inferrer.py
to comply with the output format.
encoder/encoder_server.py
:EMBEDDING_MODEL
: embedding model released by SBERTFEATURE_SIZE
: dimension of the output that the embedding model produces
- We use IndexFlatL2 combined with IndexIDMap of FAISS for semantic search
- Remember to alter the address and port of different services in the
config.py
file. Also,K
indicates number of documents retrieved in the full-text search conducted by ElasticSearch, whileL
indicating number of sentences to retrieved fromK
documents by semantic search. In other words, in information retrieval step,K
->L
documents are retrieved. - Reader offers two modes:
concat
andensemble
, which can be set in theconfig.py
:concat
: concatenates L retrieved data into a context and, with the question, put it to the language modelensemble
: each data in L retrieved data will be treated as a context and the final step is filter out an answer with highest confidence score provided by the language model. Note: this mode usually produces less accurate results.
- By default, the project runs only on CPU. Therefore, considering switching
device
to 0 orgpu
, etc. for better productivity with GPU if available.
Authors: