-
Notifications
You must be signed in to change notification settings - Fork 23
Description
Problem:
Now the QPA example code gets cpd-1a.prn
by doing
curl -O https://www.iucr.org/__data/iucr/powder/QARR/col/cpd-1a.prn
but iucr is Cloudflare protected. Cloudflare only hands the interstitial HTML (“Just a moment… Enable JavaScript and cookies to continue” HTML) instead of the raw .prn
data.
Proposed Solution:
Either
- use some python packages to bypass Cloudflare's anti-bot page (for example cloudscraper) or
- include the
.prn
file directly in the source.
I'd like to go for 2. It doesn't sound like a good idea to scrape or crawl iucr website, with some extra dependency tools. Cloudflare will also continually changing and hardening their protection page.
I'm not sure about any potential legal issue with redistributing scientific data from iucr, but iucr’s policy states that:
Copyright protection is not extended to files of scientific data (e.g. structural data CIFs, structure factors, primary diffraction images), and such data sets may be used freely for bona fide research purposes within the scientific community so long as proper attribution is given to the source from which they were obtained.
so proper attribution to the source could be needed if data files will be distributed with the package.