This repository conatins codes and descriptions of the deception detection project conducted as a part of BITAmin🍊 conference. The goal of this project is to detect signs of deception in conversations between an investigator and a suspect. To achieve this, we first utilized GPT-3.5 to generate investigator-suspect dialogue scripts containing lying signals in the suspect's utterances. Subsequently, we trained the LLaMA-2 model using this data as training data. Additionally, we constructed a RAG model which accepts incident records as external knowledge to identify inconsistencies between the conversation content and the incident records. Please refer to the presentation.pdf
file in the others
folder for detailed information about the project. Below are the frameworks used for the project.
This script is designed for generating synthetic conversation data, specifically focusing on dialogues between an investigator and a suspect. Note that I refered to official guidance of LangChain for this stage. Below is a brief explanation of the different sections and functionalities within the script:
- Loading Dataset and Preparing Prompt: The script loads a dataset from a CSV file named
contradicts.csv
. This file lists two contradictory sentences and was used to include contradictory sentences in the suspect's speech. (We obtained the data from here) Next, the code defines classes and configurations necessary for generating synthetic data. It sets up a template for generating conversation between an investigator and a suspect. This includes defining the conversation structure, the types of lying signals (IH_A
,IH_B
,VE
,LM
,TP
), and example prompts illustrating how each type of lying signals is used. - Synthetic Data Generation: This section utilizes an OpenAI model (GPT-3.5-turbo) through the LangChain library to generate synthetic conversation data based on the constructed templates.
- Preprocessing: Remove the lying siganl tags from the conversation script to build train dataset.
This script uses autotrain-advanced
package for fine-tuning LLaMA-2 model. Note that I refered to official guidance of hugging face for this stage. Below is a brief explanation of the different sections from the code.
- Setting Hyperparameters: Various hyperparameters such as model name, learning rate, number of epochs, batch size, etc., are configured. These settings play a crucial role in the model performance.
- Fine-tuning: Using autotrain, the script fine-tunes LLaMA-2 model with the provided hyperparameters. In this step, the
autotrain
command is used to:
- Set up the model and data paths
- Pass hyperparameters like learning rate, batch size, number of epochs, etc.
- Activate features such as token push, quantization, mixed precision, etc., based on options.
This script utilized LangChain
library and fine-tuned LLaMA model to construct RAG. Below is a brief explanation of the different sections from the code.
- Setting Retrievals: Loads a PDF document (
investigation-report.pdf
) and splits it into pages. Utilizes aRecursiveCharacterTextSplitter
to chunk the pages into smaller text chunks. Embeds the text chunks usingOpenAIEmbeddings
and stores them in a vector store usingChroma
. - Model Inference with RAG: Sets up a retriever to search through the stored text chunks. Given an instruction and input conversation script, the script queries the model to identify lying signals in the suspect's utterances.
🥇 Grand Prize