Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle "clusters" on paper extraction #85

Open
bzz opened this issue Sep 26, 2022 · 0 comments
Open

Handle "clusters" on paper extraction #85

bzz opened this issue Sep 26, 2022 · 0 comments

Comments

@bzz
Copy link
Owner

bzz commented Sep 26, 2022

On extracting publications (papers) from emails, a class of papers that in email look like

  • https://scholar.google.com/scholar?cluster=14905208172666766997&hl=en&oi=scholaralrt&hist=KBiQzPUAAAAJ:3103465405719670724:AAGBfm3tO_7Uk2dTXZseJcyJq0Kjaug97Q&html=&folt=rel

are skipped (14 papers out of +2k) as ATM we use a regex to extract the pdf URL from such links and it fails to match.
Instead of the usual /scholar_url?url=<url-to-the.pdf> pattern, these links looks like /scholar?cluster=14905208172666766997&... and a way to get the URL to individual pdf (any from the cluster) is not obvious.

One option is too keep those links as-is, so the user will have to choose the PDF from a scholar page themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant