Skip to content

Coagulation of several maintenance and bug fixes #568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
4df5aad
Update the proxy keys in _get_webdriver routines
arunkannawadi Jun 17, 2023
cd260d6
Stop prepending proxy with http if it is socks
arunkannawadi Jun 17, 2023
d6d03f2
Add a pub_date field to the bib dictionary
arunkannawadi Jun 18, 2023
9813d39
Update tags to get public access of publications
arunkannawadi Jun 18, 2023
63cd33a
Account for one version of mandate be cached in tests
arunkannawadi Jun 18, 2023
ecbc213
Decrease the coauthor count to make the test pass
arunkannawadi Jun 18, 2023
2d09680
Add url entry to bibtex
arunkannawadi Jun 18, 2023
fe98eb1
Fix bibtex unittest
arunkannawadi Jun 18, 2023
6caabb2
Update CITATION version to 1.7.11
arunkannawadi Jun 17, 2023
91bada0
proxy format conflict
ma-ji Jul 11, 2023
03f063e
del debug print
ma-ji Jul 11, 2023
7f86ed4
Merge pull request #507 from ma-ji/develop
arunkannawadi Jul 13, 2023
c7d4737
Update requirements.txt
melroy89 Aug 10, 2023
89c2ebd
Merge pull request #511 from melroy89/melroy89-patch-1-1
arunkannawadi Aug 11, 2023
6ddccd7
Update citations by year data
DLu Oct 24, 2023
9f19452
Results from running codespell
DLu Oct 24, 2023
91d1235
Merge pull request #520 from DLu/spell_check
arunkannawadi Nov 5, 2023
7a4da4b
Remove 2023 values from test_cites_per_year
arunkannawadi Nov 5, 2023
c6b579d
Merge pull request #519 from DLu/fix_citation_by_year_test
arunkannawadi Nov 5, 2023
2af460e
Fixed test_bibtex unit test, updated CONTRIBUTING.md
dlebedinsky Nov 20, 2023
ba3b8a4
Added test for FreeProxy
dlebedinsky Nov 29, 2023
3b5a2e8
Fixed an issue where search_pubs doesn't find a publication when only…
keko24 Jun 24, 2024
0db2bef
Fixed total_results returning 0 when only a single publication exists.
keko24 Jun 24, 2024
2cd59b3
Removed the string in search_pubs in test_search_empty_publication.
keko24 Jun 25, 2024
568d4ad
Merge pull request #542 from keko24/main
arunkannawadi Jul 3, 2024
0765945
Update publication_parser.py
NisoD Sep 15, 2024
71e4ccf
Merge pull request #525 from dlebedinsky/develop
arunkannawadi Apr 27, 2025
0324b91
Add github action to codespell develop on push and PRs
yarikoptic Oct 28, 2024
c8bf964
Add rudimentary codespell config
yarikoptic Oct 28, 2024
3e5ae31
adjust skips
yarikoptic Oct 28, 2024
16b5f89
[DATALAD RUNCMD] run codespell throughout fixing few left typos autom…
yarikoptic Oct 28, 2024
25bbad6
Merge pull request #555 from yarikoptic/enh-codespell
arunkannawadi Apr 27, 2025
a4e6c8d
docs(quickstart): add conda to install option from github README
nkxxll Dec 29, 2024
602446d
Merge pull request #560 from nkxxll/install_from_conda
arunkannawadi Apr 27, 2025
1b065ee
The current httpx doesn't support proxies arguments:
brokenjade3000 Feb 8, 2025
8a0e780
Merge pull request #564 from brokenjade3000/patch-1
arunkannawadi Apr 28, 2025
d5e27a2
Merge branch 'develop' into patch-1
arunkannawadi Apr 28, 2025
c19c99b
Merge pull request #550 from NisoD/patch-1
arunkannawadi Apr 28, 2025
67dab6f
Update publication_parser.py
tZimmermann98 Feb 3, 2025
db06043
fallback to regex year extraction or empty String when arrow fails
tZimmermann98 Feb 3, 2025
483b338
Merge pull request #563 from tZimmermann98/develop
arunkannawadi Apr 28, 2025
eecaae5
Add in PDF link in publication fill
Luen Apr 11, 2024
63f3592
Update tests and add in pdf url from search results
Luen Apr 12, 2024
35f97d7
Renamed "pdf_url" to "eprint_url"
Luen Dec 12, 2024
ae36174
Merge pull request #536 from Luen/main
arunkannawadi Apr 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,13 @@ Additionally, if you are interesting in contributing to the codebase, submit a p

## How to contribute

1. Create a fork of `scholarly-python-package/scholarly` repository.
2. If you add a new feature, try to include tests in already existing test cases, or create a new test case if that is not possible.
3. Make sure the unit tests pass before raising a PR. For all the unit tests to pass, you typically need to setup a premium proxy service such as `ScraperAPI` or `Luminati` (`Bright Data`). If you do not have an account, you may try to use `FreeProxy`. Without a proxy, 6 out of 17 test cases will be skipped.
4. Check that the documentatation is consistent with the code. Check that the documentation builds successfully.
5. Submit a PR, with `develop` as your base branch.
6. After an initial code review by the maintainers, the unit tests will be run with the `ScraperAPI` key stored in the Github repository. Passing all tests cases is necessary before merging your PR.
1. Create a fork of `scholarly-python-package/scholarly` repository. Make sure that "Copy the main branch only" is **not** checked off.
2. After cloning your fork and checking out into the develop branch, run `python setup.py --help-commands` for more info on how to install dependencies and build. You may need to run it with `sudo`.
3. If you add a new feature, try to include tests in already existing test cases, or create a new test case if that is not possible. For a comprehensive output, run `python -m unittest -v test_module.py`
4. Make sure the unit tests pass before raising a PR. For all the unit tests to pass, you typically need to setup a premium proxy service such as `ScraperAPI` or `Luminati` (`Bright Data`). By default, `python setup.py install` will get `FreeProxy`. Without a proxy, 6 out of 17 test cases will be skipped.
5. Check that the documentatation is consistent with the code. Check that the documentation builds successfully.
6. Submit a PR, with `develop` as your base branch.
7. After an initial code review by the maintainers, the unit tests will be run with the `ScraperAPI` key stored in the Github repository. Passing all tests cases is necessary before merging your PR.


## Build Docs
Expand Down
25 changes: 25 additions & 0 deletions .github/workflows/codespell.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Codespell configuration is within pyproject.toml
---
name: Codespell

on:
push:
branches: [develop]
pull_request:
branches: [develop]

permissions:
contents: read

jobs:
codespell:
name: Check for spelling errors
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v4
- name: Annotate locations with typos
uses: codespell-project/codespell-problem-matcher@v1
- name: Codespell
uses: codespell-project/actions-codespell@v2
6 changes: 3 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
### Bugfixes
- Fix pprint failures on Windows #413.
- Thoroughly handle 1000 or more publications that are available (or not) according to public access mandates #414.
- Fix errors in `download_mandates_csv` that may occassionally occur for agencies without a policy link #413.
- Fix errors in `download_mandates_csv` that may occasionally occur for agencies without a policy link #413.

## Changes in v1.6.3

Expand All @@ -35,7 +35,7 @@

### Features
- Download table of funding agencies as a CSV file with URL to the funding mandates included
- Downlad top-ranking journals in general, under sub-categories and in different languages as a CSV file
- Download top-ranking journals in general, under sub-categories and in different languages as a CSV file

### Bugfixes
- #392
Expand All @@ -58,7 +58,7 @@
## Changes in v1.5.0
### Features
- Fetch the public access mandates information from a Scholar profile and mark the publications whether or not they satisfy the open-access mandate.
- Fetch an author's organization identifer from their Scholar profile
- Fetch an author's organization identifier from their Scholar profile
- Search for all authors affiliated with an organization
- Fetch homepage URL from a Scholar profile
### Enhancements
Expand Down
2 changes: 1 addition & 1 deletion CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,4 @@ keywords:
citation-index scholarly-articles
citation-analysis scholar googlescholar
license: Unlicense
version: 1.5.0
version: 1.7.11
2 changes: 1 addition & 1 deletion CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ permalink: /coc.html
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
identity and expression, level of experience, education, socioeconomic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ This means your code that uses an earlier version of `scholarly` is guaranteed t

## Tests

To check if your installation is succesful, run the tests by executing the `test_module.py` file as:
To check if your installation is successful, run the tests by executing the `test_module.py` file as:

```bash
python3 test_module
Expand Down
6 changes: 6 additions & 0 deletions docs/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,12 @@ or use ``pip`` to install from github:

pip install git+https://github.com/scholarly-python-package/scholarly.git

or use ``conda`` to install from ``conda-forge``:

.. code:: bash

conda install -c conda-forge scholarly

or clone the package using git:

.. code:: bash
Expand Down
7 changes: 7 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

[tool.codespell]
# Ref: https://github.com/codespell-project/codespell#using-a-config-file
skip = '.git*'
check-hidden = true
ignore-regex = '\b(assertIn|Ewha Womans|citeseerx.ist.psu.edu\S*)\b'
# ignore-words-list = ''
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ arrow
beautifulsoup4
bibtexparser
deprecated
fake_useragent
fake-useragent
free-proxy
httpx
python-dotenv
Expand Down
30 changes: 18 additions & 12 deletions scholarly/_proxy_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,15 +109,15 @@ def SingleProxy(self, http=None, https=None):

:param http: http proxy address
:type http: string
:param https: https proxy adress
:param https: https proxy address
:type https: string
:returns: whether or not the proxy was set up successfully
:rtype: {bool}

:Example::

>>> pg = ProxyGenerator()
>>> success = pg.SingleProxy(http = <http proxy adress>, https = <https proxy adress>)
>>> success = pg.SingleProxy(http = <http proxy address>, https = <https proxy address>)
"""
self.logger.info("Enabling proxies: http=%s https=%s", http, https)
proxy_works = self._use_proxy(http=http, https=https)
Expand All @@ -136,7 +136,8 @@ def _check_proxy(self, proxies) -> bool:
:rtype: {bool}
"""
with requests.Session() as session:
session.proxies = proxies
# Reformat proxy for requests. Requests and HTTPX use different proxy format.
session.proxies = {'http':proxies['http://'], 'https':proxies['https://']}
try:
resp = session.get("http://httpbin.org/ip", timeout=self._TIMEOUT)
if resp.status_code == 200:
Expand All @@ -161,7 +162,7 @@ def _check_proxy(self, proxies) -> bool:
def _refresh_tor_id(self, tor_control_port: int, password: str) -> bool:
"""Refreshes the id by using a new Tor node.

:returns: Whether or not the refresh was succesful
:returns: Whether or not the refresh was successful
:rtype: {bool}
"""
try:
Expand Down Expand Up @@ -189,11 +190,12 @@ def _use_proxy(self, http: str, https: str = None) -> bool:
:returns: whether or not the proxy was set up successfully
:rtype: {bool}
"""
if http[:4] != "http":
# Reformat proxy for HTTPX
if http[:4] not in ("http", "sock"):
http = "http://" + http
if https is None:
https = http
elif https[:5] != "https":
elif https[:5] not in ("https", "socks"):
https = "https://" + https

proxies = {'http://': http, 'https://': https}
Expand Down Expand Up @@ -365,8 +367,8 @@ def _get_webdriver(self):
def _get_chrome_webdriver(self):
if self._proxy_works:
webdriver.DesiredCapabilities.CHROME['proxy'] = {
"httpProxy": self._proxies['http'],
"sslProxy": self._proxies['https'],
"httpProxy": self._proxies['http://'],
"sslProxy": self._proxies['https://'],
"proxyType": "MANUAL"
}

Expand All @@ -381,8 +383,8 @@ def _get_firefox_webdriver(self):
if self._proxy_works:
# Redirect webdriver through proxy
webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
"httpProxy": self._proxies['http'],
"sslProxy": self._proxies['https'],
"httpProxy": self._proxies['http://'],
"sslProxy": self._proxies['https://'],
"proxyType": "MANUAL",
}

Expand Down Expand Up @@ -432,7 +434,7 @@ def _handle_captcha2(self, url):
self.logger.info("Google thinks we are DOSing the captcha.")
raise e
except (WebDriverException) as e:
self.logger.info("Browser seems to be disfunctional - closed by user?")
self.logger.info("Browser seems to be dysfunctional - closed by user?")
raise e
except Exception as e:
# TODO: This exception handler should eventually be removed when
Expand Down Expand Up @@ -483,6 +485,10 @@ def _new_session(self, **kwargs):
# ScraperAPI requests to work.
# https://www.scraperapi.com/documentation/
init_kwargs["verify"] = False
if 'proxies' in init_kwargs:
proxy=init_kwargs['proxies']['https://']
del init_kwargs['proxies']
init_kwargs['proxy'] = proxy
self._session = httpx.Client(**init_kwargs)
self._webdriver = None

Expand All @@ -498,7 +504,7 @@ def _close_session(self):
self.logger.warning("Could not close webdriver cleanly: %s", e)

def _fp_coroutine(self, timeout=1, wait_time=120):
"""A coroutine to continuosly yield free proxies
"""A coroutine to continuously yield free proxies

It takes back the proxies that stopped working and marks it as dirty.
"""
Expand Down
2 changes: 1 addition & 1 deletion scholarly/_scholarly.py
Original file line number Diff line number Diff line change
Expand Up @@ -428,7 +428,7 @@ def search_pubs_custom_url(self, url: str)->_SearchScholarIterator:
parameters in the Advanced Search dialog box and then use the URL here
to programmatically fetch the results.

:param url: custom url to seach for the publication
:param url: custom url to search for the publication
:type url: string
"""
return self.__nav.search_publications(url)
Expand Down
8 changes: 4 additions & 4 deletions scholarly/author_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,14 +152,14 @@ def _fill_public_access(self, soup, author):
while True:
rows = soup.find_all('div', 'gsc_mnd_sec_na')
if rows:
for row in rows[0].find_all('a', 'gsc_mnd_art_rvw gs_nph gsc_mnd_link_font'):
for row in rows[0].find_all('a', 'gsc_mnd_art_rvw gsc_mnd_link_font'):
author_pub_id = re.findall(r"citation_for_view=([\w:-]*)",
row['data-href'])[0]
publications[author_pub_id]["public_access"] = False

rows = soup.find_all('div', 'gsc_mnd_sec_avl')
if rows:
for row in rows[0].find_all('a', 'gsc_mnd_art_rvw gs_nph gsc_mnd_link_font'):
for row in rows[0].find_all('a', 'gsc_mnd_art_rvw gsc_mnd_link_font'):
author_pub_id = re.findall(r"citation_for_view=([\w:-]*)",
row['data-href'])[0]
publications[author_pub_id]["public_access"] = True
Expand Down Expand Up @@ -222,7 +222,7 @@ def _get_coauthors_short(self, soup):
def _get_coauthors_long(self, author):
"""Get the long (>20) list of coauthors.

This method fetches the complete list of coauthors bu opening a new
This method fetches the complete list of coauthors by opening a new
page filled with the complete coauthor list.

Note:
Expand Down Expand Up @@ -283,7 +283,7 @@ def fill(self, author, sections: list = [], sortby="citedby", publication_limit:
:type sortby: string
:param publication_limit: Select the max number of publications you want you want to fill for the author. Defaults to no limit.
:type publication_limit: int
:returns: The filled object if fill was successfull, False otherwise.
:returns: The filled object if fill was successful, False otherwise.
:rtype: Author or bool

:Example::
Expand Down
6 changes: 3 additions & 3 deletions scholarly/data_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ class PublicationSource(str, Enum):

"PUBLICATION SEARCH SNIPPET".
This form captures the publication when it appears as a "snippet" in
the context of the resuls of a publication search. For example:
the context of the results of a publication search. For example:

Publication search: https://scholar.google.com/scholar?hl=en&q=adaptive+fraud+detection&btnG=&as_sdt=0%2C33

Expand Down Expand Up @@ -49,7 +49,7 @@ class PublicationSource(str, Enum):
We also have publications that appear in the "author pages" of Google Scholar.
These publications are often a set of publications "merged" together.

The snippet version of these publications conains the title of the publication,
The snippet version of these publications contains the title of the publication,
a subset of the authors, the (sometimes truncated) venue, and the year of the publication
and the number of papers that cite the publication.

Expand Down Expand Up @@ -183,7 +183,7 @@ class Publication(TypedDict, total=False):
the "citedby_url" will be a comma-separated list of values.
It is also used to return the "cluster" of all the different versions of the paper.
https://scholar.google.com/scholar?cluster=16766804411681372720&hl=en
:param cites_per_year: a dictionay containing the number of citations per year for this Publication
:param cites_per_year: a dictionary containing the number of citations per year for this Publication
(source: AUTHOR_PUBLICATION_ENTRY)
:param eprint_url: digital version of the Publication. Usually it is a pdf.
:param pub_url: url of the website providing the publication
Expand Down
25 changes: 22 additions & 3 deletions scholarly/publication_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def _load_url(self, url: str):
# this is temporary until setup json file
self._soup = self._nav._get_soup(url)
self._pos = 0
self._rows = self._soup.find_all('div', class_='gs_r gs_or gs_scl') + self._soup.find_all('div', class_='gsc_mpat_ttl')
self._rows = self._soup.select("div.gs_r.gs_or.gs_scl") + self._soup.select("div.gs_r.gs_or.gs_scl.gs_fmar") + self._soup.select("div.gsc_mpat_ttl")

def _get_total_results(self):
if self._soup.find("div", class_="gs_pda"):
Expand All @@ -70,7 +70,7 @@ def _get_total_results(self):
match = re.match(pattern=r'(^|\s*About)\s*([0-9,\.\s’]+)', string=x.text)
if match:
return int(re.sub(pattern=r'[,\.\s’]',repl='', string=match.group(2)))
return 0
return len(self._rows)

# Iterator protocol

Expand Down Expand Up @@ -202,6 +202,10 @@ def _scholar_pub(self, __data, publication: Publication):
if title.find('a'):
publication['pub_url'] = title.find('a')['href']

pdf_div = __data.find('div', class_='gs_ggs gs_fl')
if pdf_div and pdf_div.find('a', href=True):
publication['eprint_url'] = pdf_div.find('a')['href']

author_div_element = databox.find('div', class_='gs_a')
authorinfo = author_div_element.text
authorinfo = authorinfo.replace(u'\xa0', u' ') # NBSP
Expand Down Expand Up @@ -286,6 +290,10 @@ def fill(self, publication: Publication)->Publication:
if soup.find('a', class_='gsc_oci_title_link'):
publication['pub_url'] = soup.find(
'a', class_='gsc_oci_title_link')['href']
if soup.find('div', class_='gsc_oci_title_ggi'):
link = soup.find('a', attrs={'data-clk': True})
if link:
publication['eprint_url'] = link['href']
for item in soup.find_all('div', class_='gs_scl'):
key = item.find(class_='gsc_oci_field').text.strip().lower()
val = item.find(class_='gsc_oci_value')
Expand All @@ -312,7 +320,13 @@ def fill(self, publication: Publication)->Publication:
'YYYY/M/DD',
'YYYY/M/D',
'YYYY/MM/D']
publication['bib']['pub_year'] = arrow.get(val.text, patterns).year
try:
publication['bib']['pub_year'] = arrow.get(val.text, patterns).year
except ValueError:
# fallback to regex year extraction if arrow fails
match = re.search(r'\d{4}', val.text)
publication['bib']['pub_year'] = match.group() if match else ""
publication['bib']['pub_date'] = val.text
elif key == 'description':
# try to find all the gsh_csp if they exist
abstract = val.find_all(class_='gsh_csp')
Expand Down Expand Up @@ -401,6 +415,11 @@ def bibtex(self, publication: Publication) -> str:
publication = self.fill(publication)
a = BibDatabase()
converted_dict = publication['bib']
try:
url = publication['eprint_url']
except KeyError:
url = publication.get('pub_url', '')
converted_dict['url'] = url
converted_dict = remap_bib(converted_dict, _BIB_REVERSE_MAPPING)
str_dict = {key: str(value) for key, value in converted_dict.items()}
# convert every key of the dictionary to string to be Bibtex compatible
Expand Down
Loading
Loading