arXive-crawler

A crawler for retriving latex paper data from arXive. This tool aims to use the rules which are provided by arXive here https://arxiv.org/help/bulk_data.

Bulk Metadata Access

OAI-PMH

arXiv supports the OAI protocol for metadata harvesting which is described here https://www.openarchives.org/pmh/.

The standard for this protocol is described here https://www.openarchives.org/OAI/openarchivesprotocol.html
Python offers a module which implements the standard: https://pypi.org/project/pyoai/

-o /path/to/out meta --from START_DATE --until END_DATE

By default only paper in the compter science domain get crawled. This behavior can be changed in metaCrawler.py.

Latex Data Crawling

This Crawler tries to abide by the rules delcared on https://arxiv.org/help/bulk_data. Herefore the Parameters

burstSize: 4
sleepLength: 1
can be adapted manually to change the download rate. All files get either saved as .pdf or .tar.gz depending on the avialability of source material.

Disclaimer: Even though we used the performance settings suggested by arXive we got sometimes blocked. So maybe use a more conservative crawling speed.

-o /path/to/out download --sleep 4 --burst --start START_ID --end END_ID [--proxy --diff]

Because we got blocked this tool supports the use of proxy servers. Activated by supplying the --proxy flag. When activated a list of proxies will be taken from https://free-proxy-list.net/. This option is very slow and therefore shouldn't be used.

When you need to often restart the crawler it is helpful to use the --diff flag. When supplied the crawler will check which papers are already existing in the output directory and skip them accordingly.

-o /path/to/out zip --name FILE_NAME

Zips existing paper ids from output directory to single file. The ids in this file are also recognised by the --diff option of the crawler.

Latex Expanding

This Crawler is able to automatically expand Latex files using the perl script latexpand from https://gitlab.com/latexpand/latexpand/-/blob/master/latexpand. Latexpand is licensed under the BSD License https://opensource.org/licenses/BSD-3-Clause.

-o /path/to/out expand -i /path/to/in [--bib]

When --bib is supplied als the bibliography will be extract out of the LaTeX archives.

Known Issues

The crawler only reconizes the file types .pdf, .tar.gz and .docx. In very few cases the downloaded file can also be a single tex file.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
counter.py		counter.py
create_diagram.py		create_diagram.py
dataCrawler.py		dataCrawler.py
database.py		database.py
expander.py		expander.py
generate.py		generate.py
jsonExporter.py		jsonExporter.py
latexpand.pl		latexpand.pl
main.py		main.py
metaCrawler.py		metaCrawler.py
requirements.txt		requirements.txt
statstics.py		statstics.py
unfixed_errors.txt		unfixed_errors.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

arXive-crawler

Bulk Metadata Access

OAI-PMH

Latex Data Crawling

Latex Expanding

Known Issues

About

Uh oh!

Releases

Packages

Languages

License

Data-Science-2Like/arXive-crawler

Folders and files

Latest commit

History

Repository files navigation

arXive-crawler

Bulk Metadata Access

OAI-PMH

Latex Data Crawling

Latex Expanding

Known Issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages