CodeHateChallenge

Plugin

How to run plugin:

Install node.js & npm
Run in this directory
```
npm install
npm start
```

Backend

In order to run backend, use:

docker-compose up

There are 3 APIs available:

/ping - check if service is alive
/ishate - checkt if input text is hateful
/whyhate - obtain explanation for hateful paragraphs

To send API request from CLI, use:

curl -X POST -H "Content-Type: text/plain" --data "sample text" -v localhost:8080/ishate
curl -X POST -H "Content-Type: text/plain" --data "sample text" -v localhost:8080/whyhate
curl -X GET localhost:8080/ping

Data augmentation

Fast pipeline to augment data in any language

Pipeline can ba adopted easily to any language. Example pipeline for Polish language:

Scraping data (positive and negative examples)

The examples utterances (positive and negative) are scrap examples data from the page https://pl.wiktionary.org/wiki/ based on list of offensive words. This process enables to increase the number of false-positive examples in the dataset.

Selenium is used to scrap the data.

How to use:

python src/augment/scrap_wiki.py 
d = OffensiveDict("wiki_words2.json")
d.create_csv_with_utterances()

Example scraped sentences for word pies:

Czy to jest pies czy suka? " zool. Canis familiaris[1], zwierzę domowe; zob. też pies w Wikipedii" - not offencive example
Psy stoją na patrolu.,pies," slang. obraź. policjant[3], żandarm lub milicjant" - offencive example

EDA

EDA is a package used to augment data in English.
GIT source: https://github.com/jasonwei20/eda_nlp.git

Required to do before usage:

install nltk

pip install -U nltk

download wordnet

python
>>> import nltk; nltk.download('wordnet')

How to augment an English sentence:

augmenter = Augmenter()
res = augmenter.augment_text("All black people should be slaves")

Example output:

['all black people totally should be slaves', 'all black be slaves', 'all black people totally should be slaves', 'all black people should be slaves', 'all black slaves should be people', 'all black people be slaves', 'atomic number all black people should be slaves', 'all shirley temple people should be slaves', 'all black the great unwashed should be slaves', 'all black people should be slaves']

Custom back translation

Require to install:

pip install transformers
pip install neptune-client
pip install sentencepiece
pip install fairseq
pip install subword-nmt

How to use:

augmenter = Augmenter("polish_offensive_dict.json")
augmenter.back_translation("Kurwa, uchodźcy niszczą Polskę, jebane kozojebcy", first_lang="polish", second_lang="english")

Example result:

>>> Kurwa, uchodźcy niszczą Polskę, wy pierdolone kozie skurwiele.

Models

BERT UDA

Bert UDA architecture:

Other models

detoxify (will be used in demo)
transformers from huggingface (trying to find good approach for less popular languages, different experiments conducted)

1 model per language
multilingual bert
model trained on polish and added translated labeled english examples
english pretrained model trained on translated polish tweets and added english examples
multilingual model trained on sentences containing original polish tweet + translated
small demo of BERT UDA trained on polish tweets (pretrained on polish), with backtranslation (marking translating changes on 'strong' words) and with EDA

(unordered)

Results on small dataset (training ~ 600 examples) composed of 800 examples extracted by our method descibed above

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
hateless-plugin		hateless-plugin
model		model
models-training		models-training
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
ExtractLabels.ipynb		ExtractLabels.ipynb
LICENSE		LICENSE
README.md		README.md
augmented_polish_data.csv		augmented_polish_data.csv
conf.png		conf.png
docker-compose.yml		docker-compose.yml
logi.png		logi.png
polish.csv		polish.csv
polish_dict.p		polish_dict.p
polish_offensive_dict.json		polish_offensive_dict.json
requirements.txt		requirements.txt
wiki_offensive_pl_utterancs.csv		wiki_offensive_pl_utterancs.csv
wiki_words2.json		wiki_words2.json
wiktionary_parser.py		wiktionary_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodeHateChallenge

CodeHateChallenge

Plugin

Backend

Data augmentation

Fast pipeline to augment data in any language

Scraping data (positive and negative examples)

EDA

Custom back translation

Models

BERT UDA

Other models

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

AWarno/CodeHateChallenge

Folders and files

Latest commit

History

Repository files navigation

CodeHateChallenge

CodeHateChallenge

Plugin

Backend

Data augmentation

Fast pipeline to augment data in any language

Scraping data (positive and negative examples)

EDA

Custom back translation

Models

BERT UDA

Other models

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages