- Presentation
- Setup
- Data
- Configuration files
- Scripts
We introduce an end-to-end pipeline for retrieving, processing, and extracting targeted information from legal cases. This repository contains the code presented in our paper accepted for publication at ACL Findings 2023. We perform information extraction based on state-of-the-art neural named-entity recognition (NER). We test different architectures including two transformer models (RoBERTa and LegalBERT), using contextual and non-contextual embeddings, and compare general purpose versus domain-specific pre-training. The workflow is explained in more details in the 2 yml: project_case_cover_NER.yml and project_maintext_NER.yml.
The NER models are trained using Spacy Entity Recognizer (see configuration files below)
Requirements: requirements.txt
CanLII provides a dataset of 59,112 refugee cases associated with the Immigration and Refugee Board of Canada. The data is provided online in HTML and can be downloaded as PDF.
The raw data is not published on this github. It is available upon request.
Terminology/Gazetter | |
---|---|
patterns | terminology per label of the main text |
Static vectors | |
---|---|
static_vectors | static vectors |
The script prodigy_explanation.sh contains explanations and instructions on the collection of annotated samples using prodigy semi-automatic annotation tool.
Configuration files | |
---|---|
Case cover | |
baseline.cfg | Baseline CNN |
CNN+random_static.cfg | CNN with random static vectors |
config_trf.cfg | RoBERTa-based transformers |
config_trf_legalbert.cfg | LegalBERT based transformers |
Configuration files | |
---|---|
Main text (1st set of categories: pretrained) | |
baseline.cfg | Baseline CNN |
CNN+random_static.cfg | CNN with random static vectors |
CNN+static_vectors.cfg | CNN with fine-tuned static vectors |
CNN+pretraining+random.cfg | CNN with pretraining and random static vectors |
CNN+pretraining+myvectors.cfg | CNN with pretraining and fine-tuned static vectors |
config_trf.cfg | RoBERTa-based transformers |
config_trf_legalbert.cfg | LegalBERT based transformers |
Configuration files | |
---|---|
Main text (2nd set of categories: created from scratch) | |
baseline.cfg | Baseline CNN |
CNN+random_static.cfg | CNN with random static vectors |
CNN+static_vectors.cfg | CNN with fine-tuned static vectors |
CNN+pretraining+random.cfg | CNN with pretraining and random static vectors |
CNN+pretraining+myvectors.cfg | CNN with pretraining and fine-tuned static vectors |
config_trf.cfg | RoBERTa-based transformers |
config_trf_legalbert.cfg | LegalBERT based transformers |
The bash script run.sh contains an example on how to train the models using configuration files using Spacy.
Trained models can be found at this link: https://huggingface.co/clairebarale/refugee_cases_ner Models can be packaged using the "spacy package" command.
preprocessing_scripts holds necessary script to curate the training data. preprocess.py converts the annotations from jsonl to spacy format.
utils contains a few scripts to help curate the text of the documents, count the number of annotations collected, or separate labels from each other.