Skip to content

KekStroke/text-detoxification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Author: Anvar Iskhakov
Email: [email protected]
Group: BS21-AI

Prerequisites

Dependencies

Install all required packages with

pip install -r requirements.txt

Model weights

Download models folder and pasted it into /models folder in the project root

Raw data

Download raw dataset folder and pasted it into /data/raw folder in the project root

Prepare Data

To pre-process dataset to further training enter following command for the repository root:

python src/data/make_dataset.py 

One can add some arguments as well. Command with default arguments is:

python src/data/make_dataset.py --input_file data/raw/filtered.tsv --output_file data/internal/preprocessed.csv --tokenizer_model SkolkovoInstitute/roberta_toxicity_classifier 

Train model

To train final model on the preprocessed dataset enter following command for the repository root:

python src/models/train_model.py 

One can add some arguments as well. Command with default arguments is:

$ python src/models/train_model.py --max_length 85 --batch_size 64 --checkpoint_name test --model_name SkolkovoInstitute/bart-base-detox --train_ratio 0.8 -
-val_test_ratio 0.5 --learning_rate 0.00002 --weight_decay 0.01 --save_total_limit 1 --num_train_epochs 1 --save_steps 500 --eval_steps 500 --logging_steps 100

Inference

To use the final trained model on your own sentences enter following command for the repository root:

python src/models/predict_model.py

One can add some arguments as well. Command with default arguments is:

python src/models/predict_model.py --checkpoint_name best --max_length 85 --model_name SkolkovoInstitute/bart-base-detox

Miscellaneous

Other hypotheses were tested in notebooks that can be found in /notebooks/extra_hypotheses

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published