A collaborative project by Bendik Solevåg and Erik Hystad
This project requires python v3.8 and pip v21.3.1
To install the neccesary dependencies, run the setup.py
To run the project, run python3 main.py in the project root.
Evaluation results can be found in the ./results/ directory after running the benchmark in question.
This repository aims to benchmark state of the art models for norwegian language modelling in various tasks. These are the benchmarks, their datasets, and the files responsible for performing the testing.
| Benchmark | Executable | Dataset |
|---|---|---|
| Sentence-level sentiment polarity | ./sentence_level_sentiment_polarity.py | ./Data/sentence_level_sentiment_polarity/train.json ./Data/sentence_level_sentiment_polarity/test.json |
| Dialect classification | ./DialectClassification.py | ./Data/dialect_classification/dialect_tweet_train.json ./Data/dialect_classification/dialect_tweet_test.json |
| Dependency parsing | ./TokenClassification.py | ./Data/pos_tagging/no_bokmaal-ud-train.conllu ./Data/pos_tagging/no_bokmaal-ud-test.conllu ./Data/pos_tagging/no_nynorsk-ud-train.conllu ./Data/pos_tagging/no_nynorsk-ud-test.conllu |
| Part-of-speech tagging | ./TokenClassification.py | ./Data/pos_tagging/no_bokmaal-ud-train.conllu ./Data/pos_tagging/no_bokmaal-ud-test.conllu ./Data/pos_tagging/no_nynorsk-ud-train.conllu ./Data/pos_tagging/no_nynorsk-ud-test.conllu |
| Named entity recognition | ./TokenClassification.py | ./Data/pos_tagging/no_bokmaal-ud-train.conllu ./Data/pos_tagging/no_bokmaal-ud-test.conllu ./Data/pos_tagging/no_nynorsk-ud-train.conllu ./Data/pos_tagging/no_nynorsk-ud-test.conllu |
The models we are benchmarking are each described in their own paper.
We found that Huggingface had a well developed knowledge base, and found this article on fine tuning a pretrained model particularly helpful. This article on training for named entity recognition we also relied heavily upon.