Refugee_cases

My Project presentation

We introduce an end-to-end pipeline for retrieving, processing, and extracting targeted information from legal cases. This repository contains the code presented in our paper accepted for publication at ACL Findings 2023. We perform information extraction based on state-of-the-art neural named-entity recognition (NER). We test different architectures including two transformer models (RoBERTa and LegalBERT), using contextual and non-contextual embeddings, and compare general purpose versus domain-specific pre-training. The workflow is explained in more details in the 2 yml: project_case_cover_NER.yml and project_maintext_NER.yml.

Setup

The NER models are trained using Spacy Entity Recognizer (see configuration files below)

Requirements: requirements.txt

Data:

CanLII provides a dataset of 59,112 refugee cases associated with the Immigration and Refugee Board of Canada. The data is provided online in HTML and can be downloaded as PDF.

The raw data is not published on this github. It is available upon request.

Terminology/Gazetter
patterns	terminology per label of the main text

Static vectors
static_vectors	static vectors

The script prodigy_explanation.sh contains explanations and instructions on the collection of annotated samples using prodigy semi-automatic annotation tool.

Models:

Configuration files
Case cover
baseline.cfg	Baseline CNN
CNN+random_static.cfg	CNN with random static vectors
config_trf.cfg	RoBERTa-based transformers
config_trf_legalbert.cfg	LegalBERT based transformers

Configuration files
Main text (1st set of categories: pretrained)
baseline.cfg	Baseline CNN
CNN+random_static.cfg	CNN with random static vectors
CNN+static_vectors.cfg	CNN with fine-tuned static vectors
CNN+pretraining+random.cfg	CNN with pretraining and random static vectors
CNN+pretraining+myvectors.cfg	CNN with pretraining and fine-tuned static vectors
config_trf.cfg	RoBERTa-based transformers
config_trf_legalbert.cfg	LegalBERT based transformers

Configuration files
Main text (2nd set of categories: created from scratch)
baseline.cfg	Baseline CNN
CNN+random_static.cfg	CNN with random static vectors
CNN+static_vectors.cfg	CNN with fine-tuned static vectors
CNN+pretraining+random.cfg	CNN with pretraining and random static vectors
CNN+pretraining+myvectors.cfg	CNN with pretraining and fine-tuned static vectors
config_trf.cfg	RoBERTa-based transformers
config_trf_legalbert.cfg	LegalBERT based transformers

The bash script run.sh contains an example on how to train the models using configuration files using Spacy.

Trained models can be found at this link: https://huggingface.co/clairebarale/refugee_cases_ner Models can be packaged using the "spacy package" command.

Scripts:

preprocessing_scripts holds necessary script to curate the training data. preprocess.py converts the annotations from jsonl to spacy format.

utils contains a few scripts to help curate the text of the documents, count the number of annotations collected, or separate labels from each other.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
extract_items		extract_items
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Refugee_cases

Table of contents:

My Project presentation

Setup

Data:

Models:

Scripts:

About

Releases

Packages

Languages

clairebarale/refugee_cases_ner

Folders and files

Latest commit

History

Repository files navigation

Refugee_cases

Table of contents:

My Project presentation

Setup

Data:

Models:

Scripts:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages