-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Note that this issue occurs only in #25, not the master branch. The splitting model implemented in 3e48684 performs badly, e.g.:
(virtualenv) $ python -m deep_reference_parser split "Upson MA (2019). This is a reference. In a journal. 16(1) 1-23" -t
Using TensorFlow backend.
ℹ Using config file:
/home/matthew/Documents/wellcome/deep_reference_parser/deep_reference_parser/configs/2020.3.6_splitting.ini
ℹ Attempting to download model artefacts if they are not found locally
in models/splitting/2020.3.6_splitting/. This may take some time...
✔ Found models/splitting/2020.3.6_splitting/indices.pickle
✔ Found models/splitting/2020.3.6_splitting/weights.h5
✔ Found embeddings/2020.1.1-wellcome-embeddings-300.txt
=============================== Token Results ===============================
token label
--------- -----
Upson null
MA i-r
( i-r
2019 i-r
) i-r
. o
This o
is o
a o
reference o
. o
In i-r
a i-r
journal i-r
. o
16(1 o
) o
1 o
- o
23 o
It was expected that this model would perform less well than the model implemented in 2020.3.1, however it seems to be worse than expected.
The new model 2020.3.6
is required to ensure compatibility with the changes implemented in #25. Changes to the Rodrigues data format mean that this model runs in less than one hours, instead of around 16 hours.
Some experimentation with hyper-parameters is probably all that is needed to bring this model up to scratch, and in any case it is largely superseded by the multitask split_parse
model. If a high quality splitting model is required immediately, revert to an earlier Pre-release version for now, all of which perform very well.