NameIt

NameIt is a software tool that renames research articles in pdf files in a standardised way.

Based on the pdf metadata and on the content of the document first page, it renames the file with author, year, title, publication, and publisher. It can extract structured data from PDFs using: (1) pdf meta-data, (2) the CrossRef API, and (3) state-of-the-art layout understanding AI/ML models.

Extract structured data from PDFs using state-of-the-art layout understanding models.

Build Status

In-and-out

INPUT:

One or several pdf files (e.g., journal articles)

OUTPUT:

The same pdf files are renamed in a standardised way - author, year, title, publication (e.g., journal name), and publisher.

Example: "s13174-015-0028-2.pdf" as downloaded from the publisher would become "Teixeira et al. (2015). Lessons learned from applying social network analysis on an industrial Free/Libre/Open Source Software ecosystem. Journal of Internet Services and Applications. Springer.pdf" as renamed by the NameIt tool.

Mission

MISSION 1 - Enable researchers across the world to store and find research articles in their computers, servers or shared folders in a easier and faster way.
MISSION 2 - Advance the standardisation on how research articles in pdf format are named.

Benefits of using NameIT

Research articles are easier to get.
Research articles are stored in a standardized way.

Advantages of retrieving a well named pdf file from a hard drive over the publisher's website

Faster to find it
Faster to open it
Faster to print it (consider sustainability and ecological issues before printing)
Less duplication of digital resources (note that you are saving space on your hard disk and saving traffic on the web)
No need to login
No need to connect to VPN
No need to resolve some DOI
No adds
No paywalls
No overload with related or non-related information
You can get the pdf straight before even seeing the HTML web version
No cookies tracking your behaviour online
No hidden fingerprinting codes or watermarking schemes that identify the buyers for every copy of each PDF sold

Advantages of adoptiong the NameIt standard for naming research articles in pdf.

Easier to understand what is on a file as its filename encodes information on author, year, title, publication (e.g., journal name), and publisher.
Easier to link pdf files to entries on software tools suporting systematic literature reviews (see https://aut.ac.nz.libguides.com/systematic_reviews/tools for more information).
Easier to link pdf files to reference management systems (see https://www.helsinki.fi/en/helsinki-university-library/using-collections/courses-and-workshops/reference-management-software for more information).
Better interopability between software tools that deal with research articles in pdf.
Predictability (i.e., you will not be surprised on how a pdf file is named after being downloaded from a publisher website or the website of colleague).
Easier exchangability of collection of research articles between researchers, research groups, libraries and publishers.

TARGET USERS:

Human users - Researchers or research teams that want to store and exchange files with a standard naming convention.
Software users - Software tools supporting the management of citations, bibliometrics, references, and literature reviews have now a tool and a standard way to "file name" research articles in PDF format.

Requirements

Python 3 - Tested on > 3.10
Habanero
PyMuPDF
Transformers
PyTorch

We suport the ext3, ext4, xfs, zfs, NTFS, APFS, HFS+ and xFAT filesystems. Support for FAT32 pending. Should work in most modern computers running Linux, macOS and Windows. Problems could arise with the use of old USB flash drives and SDcard disks in old Linux Kernels.

Dependencies

NameIt is pure python3 code. It depends on several online public packages, published on the Python Package Index (PyPI). Those dependencies can be installed then using pip/pip3 - the package installer for Python.

Habanero

$ pip3 install habanero <- required to call the Crossref API (it returns the article's metadata for a given DOI identifier).

PyMuPDF

$ pip3 install PyMuPDF <- required to process the first page of a pdf article.

HuggingFace Transformers, PyTorch

$ pip3 install transformers torch <- required to use the LayoutLM AI/ML model

How to use it

Clone/Fork the repository or download its source code
Install the Habanero, PyMuPDF, Transformers, PyTorch and other dependencies $ pip3 install -r requirements.txt
Invoke the Python script and pass the file to be renamed as an argument.
Last tested on Mac and Linux with Python 3.10, 3.11, 3.12 and 3.13
You can also pass a folder as an argument and NameIt will attempt to rename all pdf files in that folder.

Example:

$ NameIt 4242343.pdf

$ NameIt research-articles-collection

$ NameIt --help

How it works

By default, the tool first tries to find pdf's metadata to rename the file. If no metadata is found, it will look on the article first page for a DOI (Digital Object Identifier) and then tries to connect to Internet to retrieve the metadata associated to the DOI.

If pdf's metadata is not found and the retrieval of metadata via the DOI fails, the tool will still try to find the author, year, title, publication, and publisher from the article 1st page. It looks for the size of the text fonts to distinguishes between what is a title or what is the author information and so on, all by using the LayoutLMv3 pre-trained multimodal Transformer AI/ML model.

From more that 100 journal articles downloaded directly from the publishers websites. We could rename 98 without issues. Author names with no so common accents and articles titles with not so common characters can be problematic.

Future features

A GUI version for less tech users is forthcoming <funding needed - funding being appied - new contributors welcome>.
Support for wildcards (e.g., NameIt folder1/*pdf Elon*.pdf)

License

MIT license. Please acknowledge derivative works.

Acknowledgements

First created by Jose Teixeira jose.teixeira@abo.fi
First contribuitions by Sukrit
Support from the Academy of Finland via the DiWIL project <see https://web.abo.fi/projekt/diwil/>
Support from the open-science initiatives of Åbo Akademi

Name		Name	Last commit message	Last commit date
Latest commit History 275 Commits
.github/workflows		.github/workflows
admin		admin
models		models
test-data		test-data
unit-tests		unit-tests
utils		utils
.DS_Store		.DS_Store
NameIt		NameIt
NameIt.py		NameIt.py
NameItCrossRef.py		NameItCrossRef.py
NameItKeywords.py		NameItKeywords.py
NameItLayoutLM.py		NameItLayoutLM.py
README.md		README.md
TODO.txt		TODO.txt
The NameIt standard for naming journal articles in pdf file.md		The NameIt standard for naming journal articles in pdf file.md
dependencies.sh		dependencies.sh
recoverTestData.sh		recoverTestData.sh
release_plan.md		release_plan.md
requirements.txt		requirements.txt
runTestData.sh		runTestData.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NameIt

Extract structured data from PDFs using state-of-the-art layout understanding models.

Build Status

In-and-out

INPUT:

OUTPUT:

Mission

Benefits of using NameIT

Advantages of retrieving a well named pdf file from a hard drive over the publisher's website

Advantages of adoptiong the NameIt standard for naming research articles in pdf.

TARGET USERS:

Requirements

Dependencies

Habanero

PyMuPDF

HuggingFace Transformers, PyTorch

How to use it

Example:

How it works

Future features

License

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

jaateixeira/nameit

Folders and files

Latest commit

History

Repository files navigation

NameIt

Extract structured data from PDFs using state-of-the-art layout understanding models.

Build Status

In-and-out

INPUT:

OUTPUT:

Mission

Benefits of using NameIT

Advantages of retrieving a well named pdf file from a hard drive over the publisher's website

Advantages of adoptiong the NameIt standard for naming research articles in pdf.

TARGET USERS:

Requirements

Dependencies

Habanero

PyMuPDF

HuggingFace Transformers, PyTorch

How to use it

Example:

How it works

Future features

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages