Jing Fan*, Dennis Aumiller*, and Michael Gertz
Institute of Computer Science, Heidelberg University
* These authors contributed equally to this work.
You can reach us via the Github issues, or write us a mail to [email protected]
!
2023-05-23: A pre-print of our work is now available on arXiv.
2023-05-15: Our work has been accepted at *SEM 2023! We will update the citation once the proceedings become available.
We provide an exhaustive list of required packages through the requirements.txt
file.
However, given the finicky dependency issues surrounding the (nowadays deprecated) AllenNLP release, as well as the spaCy versions required,
we strongly suggest creating a new environment in which to install this package.
You can install the required core dependencies with
python3 -m pip install -r requirements.txt
This works (guaranteed) for Python versions 3.8 and 3.9; we do not guarantee a full compatibility with 3.10.
Furthermore, we encountered some (temporary?) issues regarding the dependency on typing-extensions==4.6.0
, respectively pydantic
.
More information can be found in this Github issue.
Should you encounter a similar problem, consider manuall downgrading your typing-extensions
version to typing-extensions==4.5.0
.
The general usage of our metric SRLScore
is as follows:
from SRLScore import SRLScore
# Default values are reasonable for most cases
scorer = SRLScore()
scorer.score(input_text, summary_text)
You can also see the example_usage.py
file. Note that SRLScore
heavily relies on annotations generated by a (neural) SRL tagger.
This means that, if you have a GPU available, the processing time should be significantly faster.
To repeat experiments that we performed, you may run the eval.sh
script in this folder.
We further experimented with leaeve-one-argument-out variants of our weights, which is documented in eval_leave_out-exp.sh
.
Scripts to reproduce the baseline scores (particularly for BARTScore and CoCo, the two most competitive methods with implementations available),
can be found in baselines/
.
For CoCo, you may further need to clone the respective paper's code repository, copy our coco_commands.sh
script in their main folder,
and run from there.
significance_testing.py
will re-compute the significance of differences between various methods.
Note that we apply Bonferroni correction, which makes the significance threshold fairly small!
If you found this repository helpful, please consider citing our work:
@article{fan-etal-2023-evaluating,
title={{Evaluating Factual Consistency of Texts with Semantic Role Labeling}},
author={Jing Fan and Dennis Aumiller and Michael Gertz},
journal={CoRR},
volume={abs/2305.13309},
year={2023},
eprint={2305.13309},
eprinttype={arXiv},
primaryClass={cs.CL}
}