Skill Extraction: benchmarks

This is the official repository containing three skill extraction datasets, from the following papers:

Dataset description

The TECH and HOUSE subsets form an extention of the SkillSpan [1] dataset, in which spans of skill mentions in sentences have been labeled with corresponding ESCO [2] skills.

The TECHWOLF subset, although smaller, represents a more generic distribution of job descriptions and skill spans. ESCO skills are directly annotated on the full sentence level, thus omitting the intermediate span identification step.

The ESCO skills in the dataset are referenced by their preferred label, in the 1.1.0 ESCO version.

Dataset statistics	TECH		HOUSE		TECHWOLF
	val	test	val	test	test
# sentences	470	1882	243	973	326
# spans	262	1024	191	786	588
# spans with ESCO label	152	644	131	532	588

Usage

It is recommended to use the HuggingFace datasets for ease of use:

However, the raw dataset files are also kept under the data directory.

Cite

If you use the TECH or HOUSE dataset, please include the following reference:

@inproceedings{8770980,
  articleno    = {{4}},
  author       = {{Decorte, Jens-Joris and Van Hautte, Jeroen and Deleu, Johannes and Develder, Chris and Demeester, Thomas}},
  booktitle    = {{Proceedings of the 2nd Workshop on Recommender Systems for Human Resources (RecSys-in-HR 2022)}},
  editor       = {{Kaya, Mesut and Bogers, Toine and Graus, David and Mesbah, Sepideh and Johnson, Chris and Gutiérrez, Francisco}},
  isbn         = {{9781450398565}},
  issn         = {{1613-0073}},
  language     = {{eng}},
  location     = {{Seatle, USA}},
  pages        = {{7}},
  publisher    = {{CEUR}},
  title        = {{Design of negative sampling strategies for distantly supervised skill extraction}},
  url          = {{https://ceur-ws.org/Vol-3218/RecSysHR2022-paper_4.pdf}},
  volume       = {{3218}},
  year         = {{2022}},
}

If you use the TECHWOLF dataset, please include the following refence:

@misc{decorte2023extrememultilabelskillextraction,
      title={Extreme Multi-Label Skill Extraction Training using Large Language Models}, 
      author={Jens-Joris Decorte and Severine Verlinden and Jeroen Van Hautte and Johannes Deleu and Chris Develder and Thomas Demeester},
      year={2023},
      eprint={2307.10778},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2307.10778}, 
}

Reference

[1] Zhang, Mike, et al. "Skillspan: Hard and soft skill extraction from english job postings." arXiv preprint arXiv:2204.12811 (2022).

[2] https://esco.ec.europa.eu/en/classification/skill_main

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skill Extraction: benchmarks

Dataset description

Usage

Cite

Reference

About

Releases

Packages

Contributors 2

jensjorisdecorte/Skill-Extraction-benchmark

Folders and files

Latest commit

History

Repository files navigation

Skill Extraction: benchmarks

Dataset description

Usage

Cite

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages