scholarly-python-package · arunkannawadi · Jun 17, 2023 · Jun 17, 2023 · Jun 18, 2023 · Jun 18, 2023
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -16,12 +16,13 @@ Additionally, if you are interesting in contributing to the codebase, submit a p
 
 ## How to contribute
 
-1. Create a fork of `scholarly-python-package/scholarly` repository.
-2. If you add a new feature, try to include tests in already existing test cases, or create a new test case if that is not possible.
-3. Make sure the unit tests pass before raising a PR. For all the unit tests to pass, you typically need to setup a premium proxy service such as `ScraperAPI` or `Luminati` (`Bright Data`). If you do not have an account, you may try to use `FreeProxy`. Without a proxy, 6 out of 17 test cases will be skipped.
-4. Check that the documentatation is consistent with the code. Check that the documentation builds successfully.
-5. Submit a PR, with `develop` as your base branch.
-6. After an initial code review by the maintainers, the unit tests will be run with the `ScraperAPI` key stored in the Github repository. Passing all tests cases is necessary before merging your PR.
+1. Create a fork of `scholarly-python-package/scholarly` repository. Make sure that "Copy the main branch only" is **not** checked off.
+2. After cloning your fork and checking out into the develop branch, run `python setup.py --help-commands` for more info on how to install dependencies and build. You may need to run it with `sudo`.
+3. If you add a new feature, try to include tests in already existing test cases, or create a new test case if that is not possible. For a comprehensive output, run `python -m unittest -v test_module.py`
+4. Make sure the unit tests pass before raising a PR. For all the unit tests to pass, you typically need to setup a premium proxy service such as `ScraperAPI` or `Luminati` (`Bright Data`). By default, `python setup.py install` will get `FreeProxy`. Without a proxy, 6 out of 17 test cases will be skipped.
+5. Check that the documentatation is consistent with the code. Check that the documentation builds successfully.
+6. Submit a PR, with `develop` as your base branch.
+7. After an initial code review by the maintainers, the unit tests will be run with the `ScraperAPI` key stored in the Github repository. Passing all tests cases is necessary before merging your PR.
 
 
 ## Build Docs

diff --git a/.github/workflows/codespell.yml b/.github/workflows/codespell.yml
@@ -0,0 +1,25 @@
+# Codespell configuration is within pyproject.toml
+---
+name: Codespell
+
+on:
+  push:
+    branches: [develop]
+  pull_request:
+    branches: [develop]
+
+permissions:
+  contents: read
+
+jobs:
+  codespell:
+    name: Check for spelling errors
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Annotate locations with typos
+        uses: codespell-project/codespell-problem-matcher@v1
+      - name: Codespell
+        uses: codespell-project/actions-codespell@v2
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,7 +8,7 @@
 ### Bugfixes
 - Fix pprint failures on Windows #413.
 - Thoroughly handle 1000 or more publications that are available (or not) according to public access mandates #414.
-- Fix errors in `download_mandates_csv` that may occassionally occur for agencies without a policy link #413.
+- Fix errors in `download_mandates_csv` that may occasionally occur for agencies without a policy link #413.
 
 ## Changes in v1.6.3
 
@@ -35,7 +35,7 @@
 
 ### Features
 - Download table of funding agencies as a CSV file with URL to the funding mandates included
-- Downlad top-ranking journals in general, under sub-categories and in different languages as a CSV file
+- Download top-ranking journals in general, under sub-categories and in different languages as a CSV file
 
 ### Bugfixes
 - #392
@@ -58,7 +58,7 @@
 ## Changes in v1.5.0
 ### Features
 - Fetch the public access mandates information from a Scholar profile and mark the publications whether or not they satisfy the open-access mandate.
-- Fetch an author's organization identifer from their Scholar profile
+- Fetch an author's organization identifier from their Scholar profile
 - Search for all authors affiliated with an organization
 - Fetch homepage URL from a Scholar profile
 ### Enhancements

diff --git a/CITATION.cff b/CITATION.cff
@@ -52,4 +52,4 @@ keywords:
     citation-index scholarly-articles
     citation-analysis scholar googlescholar
 license: Unlicense
-version: 1.5.0
+version: 1.7.11
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -8,7 +8,7 @@ permalink: /coc.html
 We as members, contributors, and leaders pledge to make participation in our
 community a harassment-free experience for everyone, regardless of age, body
 size, visible or invisible disability, ethnicity, sex characteristics, gender
-identity and expression, level of experience, education, socio-economic status,
+identity and expression, level of experience, education, socioeconomic status,
 nationality, personal appearance, race, religion, or sexual identity
 and orientation.
 

diff --git a/README.md b/README.md
@@ -53,7 +53,7 @@ This means your code that uses an earlier version of `scholarly` is guaranteed t
 
 ## Tests
 
-To check if your installation is succesful, run the tests by executing the `test_module.py` file as:
+To check if your installation is successful, run the tests by executing the `test_module.py` file as:
 
 ```bash
 python3 test_module

diff --git a/docs/quickstart.rst b/docs/quickstart.rst
@@ -16,6 +16,12 @@ or use ``pip`` to install from github:
 
     pip install git+https://github.com/scholarly-python-package/scholarly.git
 
+or use ``conda`` to install from ``conda-forge``:
+
+.. code:: bash
+
+    conda install -c conda-forge scholarly
+
 or clone the package using git:
 
 .. code:: bash

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,3 +1,10 @@
 [build-system]
 requires = ["setuptools", "wheel"]
 build-backend = "setuptools.build_meta"
+
+[tool.codespell]
+# Ref: https://github.com/codespell-project/codespell#using-a-config-file
+skip = '.git*'
+check-hidden = true
+ignore-regex = '\b(assertIn|Ewha Womans|citeseerx.ist.psu.edu\S*)\b'
+# ignore-words-list = ''
diff --git a/requirements.txt b/requirements.txt
@@ -2,7 +2,7 @@ arrow
 beautifulsoup4
 bibtexparser
 deprecated
-fake_useragent
+fake-useragent
 free-proxy
 httpx
 python-dotenv

diff --git a/scholarly/_proxy_generator.py b/scholarly/_proxy_generator.py
@@ -109,15 +109,15 @@ def SingleProxy(self, http=None, https=None):
 
         :param http: http proxy address
         :type http: string
-        :param https: https proxy adress
+        :param https: https proxy address
         :type https: string
         :returns: whether or not the proxy was set up successfully
         :rtype: {bool}
 
         :Example::
 
             >>> pg = ProxyGenerator()
-            >>> success = pg.SingleProxy(http = <http proxy adress>, https = <https proxy adress>)
+            >>> success = pg.SingleProxy(http = <http proxy address>, https = <https proxy address>)
         """
         self.logger.info("Enabling proxies: http=%s https=%s", http, https)
         proxy_works = self._use_proxy(http=http, https=https)
@@ -136,7 +136,8 @@ def _check_proxy(self, proxies) -> bool:
         :rtype: {bool}
         """
         with requests.Session() as session:
-            session.proxies = proxies
+            # Reformat proxy for requests. Requests and HTTPX use different proxy format.
+            session.proxies = {'http':proxies['http://'], 'https':proxies['https://']}
             try:
                 resp = session.get("http://httpbin.org/ip", timeout=self._TIMEOUT)
                 if resp.status_code == 200:
@@ -161,7 +162,7 @@ def _check_proxy(self, proxies) -> bool:
     def _refresh_tor_id(self, tor_control_port: int, password: str) -> bool:
         """Refreshes the id by using a new Tor node.
 
-        :returns: Whether or not the refresh was succesful
+        :returns: Whether or not the refresh was successful
         :rtype: {bool}
         """
         try:
@@ -189,11 +190,12 @@ def _use_proxy(self, http: str, https: str = None) -> bool:
         :returns: whether or not the proxy was set up successfully
         :rtype: {bool}
         """
-        if http[:4] != "http":
+        # Reformat proxy for HTTPX
+        if http[:4] not in ("http", "sock"):
             http = "http://" + http
         if https is None:
             https = http
-        elif https[:5] != "https":
+        elif https[:5] not in ("https", "socks"):
             https = "https://" + https
 
         proxies = {'http://': http, 'https://': https}
@@ -365,8 +367,8 @@ def _get_webdriver(self):
     def _get_chrome_webdriver(self):
         if self._proxy_works:
             webdriver.DesiredCapabilities.CHROME['proxy'] = {
-                "httpProxy": self._proxies['http'],
-                "sslProxy": self._proxies['https'],
+                "httpProxy": self._proxies['http://'],
+                "sslProxy": self._proxies['https://'],
                 "proxyType": "MANUAL"
             }
 
@@ -381,8 +383,8 @@ def _get_firefox_webdriver(self):
         if self._proxy_works:
             # Redirect webdriver through proxy
             webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
-                "httpProxy": self._proxies['http'],
-                "sslProxy": self._proxies['https'],
+                "httpProxy": self._proxies['http://'],
+                "sslProxy": self._proxies['https://'],
                 "proxyType": "MANUAL",
             }
 
@@ -432,7 +434,7 @@ def _handle_captcha2(self, url):
                 self.logger.info("Google thinks we are DOSing the captcha.")
                 raise e
             except (WebDriverException) as e:
-                self.logger.info("Browser seems to be disfunctional - closed by user?")
+                self.logger.info("Browser seems to be dysfunctional - closed by user?")
                 raise e
             except Exception as e:
                 # TODO: This exception handler should eventually be removed when
@@ -483,6 +485,10 @@ def _new_session(self, **kwargs):
                 # ScraperAPI requests to work.
                 # https://www.scraperapi.com/documentation/
                 init_kwargs["verify"] = False
+            if 'proxies' in init_kwargs:
+                proxy=init_kwargs['proxies']['https://']
+                del init_kwargs['proxies']
+                init_kwargs['proxy'] = proxy                       
         self._session = httpx.Client(**init_kwargs)
         self._webdriver = None
 
@@ -498,7 +504,7 @@ def _close_session(self):
                 self.logger.warning("Could not close webdriver cleanly: %s", e)
 
     def _fp_coroutine(self, timeout=1, wait_time=120):
-        """A coroutine to continuosly yield free proxies
+        """A coroutine to continuously yield free proxies
 
         It takes back the proxies that stopped working and marks it as dirty.
         """

diff --git a/scholarly/_scholarly.py b/scholarly/_scholarly.py
@@ -428,7 +428,7 @@ def search_pubs_custom_url(self, url: str)->_SearchScholarIterator:
         parameters in the Advanced Search dialog box and then use the URL here
         to programmatically fetch the results.
 
-        :param url: custom url to seach for the publication
+        :param url: custom url to search for the publication
         :type url: string
         """
         return self.__nav.search_publications(url)

diff --git a/scholarly/author_parser.py b/scholarly/author_parser.py
@@ -152,14 +152,14 @@ def _fill_public_access(self, soup, author):
         while True:
             rows = soup.find_all('div', 'gsc_mnd_sec_na')
             if rows:
-                for row in rows[0].find_all('a', 'gsc_mnd_art_rvw gs_nph gsc_mnd_link_font'):
+                for row in rows[0].find_all('a', 'gsc_mnd_art_rvw gsc_mnd_link_font'):
                     author_pub_id = re.findall(r"citation_for_view=([\w:-]*)",
                                                row['data-href'])[0]
                     publications[author_pub_id]["public_access"] = False
 
             rows = soup.find_all('div', 'gsc_mnd_sec_avl')
             if rows:
-                for row in rows[0].find_all('a', 'gsc_mnd_art_rvw gs_nph gsc_mnd_link_font'):
+                for row in rows[0].find_all('a', 'gsc_mnd_art_rvw gsc_mnd_link_font'):
                     author_pub_id = re.findall(r"citation_for_view=([\w:-]*)",
                                                row['data-href'])[0]
                     publications[author_pub_id]["public_access"] = True
@@ -222,7 +222,7 @@ def _get_coauthors_short(self, soup):
     def _get_coauthors_long(self, author):
         """Get the long (>20) list of coauthors.
 
-        This method fetches the complete list of coauthors bu opening a new
+        This method fetches the complete list of coauthors by opening a new
         page filled with the complete coauthor list.
 
         Note:
@@ -283,7 +283,7 @@ def fill(self, author, sections: list = [], sortby="citedby", publication_limit:
         :type sortby: string
         :param publication_limit: Select the max number of publications you want you want to fill for the author. Defaults to no limit.
         :type publication_limit: int
-        :returns: The filled object if fill was successfull, False otherwise.
+        :returns: The filled object if fill was successful, False otherwise.
         :rtype: Author or bool
 
         :Example::

diff --git a/scholarly/data_types.py b/scholarly/data_types.py
@@ -20,7 +20,7 @@ class PublicationSource(str, Enum):
 
     "PUBLICATION SEARCH SNIPPET".
     This form captures the publication  when it appears as a "snippet" in
-    the context of the resuls of a publication search. For example:
+    the context of the results of a publication search. For example:
 
     Publication search: https://scholar.google.com/scholar?hl=en&q=adaptive+fraud+detection&btnG=&as_sdt=0%2C33
 
@@ -49,7 +49,7 @@ class PublicationSource(str, Enum):
     We also have publications that appear in the "author pages" of Google Scholar.
     These publications are often a set of publications "merged" together.
 
-    The snippet version of these publications conains the title of the publication,
+    The snippet version of these publications contains the title of the publication,
     a subset of the authors, the (sometimes truncated) venue, and the year of the publication
     and the number of papers that cite the publication.
 
@@ -183,7 +183,7 @@ class Publication(TypedDict, total=False):
                        the "citedby_url" will be a comma-separated list of values.
                        It is also used to return the "cluster" of all the different versions of the paper.
                        https://scholar.google.com/scholar?cluster=16766804411681372720&hl=en
-    :param cites_per_year: a dictionay containing the number of citations per year for this Publication
+    :param cites_per_year: a dictionary containing the number of citations per year for this Publication
                            (source: AUTHOR_PUBLICATION_ENTRY)
     :param eprint_url: digital version of the Publication. Usually it is a pdf.
     :param pub_url: url of the website providing the publication

diff --git a/scholarly/publication_parser.py b/scholarly/publication_parser.py
@@ -58,7 +58,7 @@ def _load_url(self, url: str):
         # this is temporary until setup json file
         self._soup = self._nav._get_soup(url)
         self._pos = 0
-        self._rows = self._soup.find_all('div', class_='gs_r gs_or gs_scl') + self._soup.find_all('div', class_='gsc_mpat_ttl')
+        self._rows = self._soup.select("div.gs_r.gs_or.gs_scl") + self._soup.select("div.gs_r.gs_or.gs_scl.gs_fmar") + self._soup.select("div.gsc_mpat_ttl")
 
     def _get_total_results(self):
         if self._soup.find("div", class_="gs_pda"):
@@ -70,7 +70,7 @@ def _get_total_results(self):
             match = re.match(pattern=r'(^|\s*About)\s*([0-9,\.\s’]+)', string=x.text)
             if match:
                 return int(re.sub(pattern=r'[,\.\s’]',repl='', string=match.group(2)))
-        return 0
+        return len(self._rows)
 
     # Iterator protocol
 
@@ -202,6 +202,10 @@ def _scholar_pub(self, __data, publication: Publication):
         if title.find('a'):
             publication['pub_url'] = title.find('a')['href']
 
+        pdf_div = __data.find('div', class_='gs_ggs gs_fl')
+        if pdf_div and pdf_div.find('a', href=True):
+            publication['eprint_url'] = pdf_div.find('a')['href']
+
         author_div_element = databox.find('div', class_='gs_a')
         authorinfo = author_div_element.text
         authorinfo = authorinfo.replace(u'\xa0', u' ')       # NBSP
@@ -286,6 +290,10 @@ def fill(self, publication: Publication)->Publication:
             if soup.find('a', class_='gsc_oci_title_link'):
                 publication['pub_url'] = soup.find(
                     'a', class_='gsc_oci_title_link')['href']
+            if soup.find('div', class_='gsc_oci_title_ggi'):
+                link = soup.find('a', attrs={'data-clk': True})
+                if link:
+                    publication['eprint_url'] = link['href']
             for item in soup.find_all('div', class_='gs_scl'):
                 key = item.find(class_='gsc_oci_field').text.strip().lower()
                 val = item.find(class_='gsc_oci_value')
@@ -312,7 +320,13 @@ def fill(self, publication: Publication)->Publication:
                                 'YYYY/M/DD',
                                 'YYYY/M/D',
                                 'YYYY/MM/D']
-                    publication['bib']['pub_year'] = arrow.get(val.text, patterns).year
+                    try:
+                        publication['bib']['pub_year'] = arrow.get(val.text, patterns).year
+                    except ValueError:
+                        # fallback to regex year extraction if arrow fails
+                        match = re.search(r'\d{4}', val.text)
+                        publication['bib']['pub_year'] = match.group() if match else ""
+                    publication['bib']['pub_date'] = val.text
                 elif key == 'description':
                     # try to find all the gsh_csp if they exist
                     abstract = val.find_all(class_='gsh_csp')
@@ -401,6 +415,11 @@ def bibtex(self, publication: Publication) -> str:
             publication = self.fill(publication)
         a = BibDatabase()
         converted_dict = publication['bib']
+        try:
+            url = publication['eprint_url']
+        except KeyError:
+            url = publication.get('pub_url', '')
+        converted_dict['url'] = url
         converted_dict = remap_bib(converted_dict, _BIB_REVERSE_MAPPING)
         str_dict = {key: str(value) for key, value in converted_dict.items()}
         # convert every key of the dictionary to string to be Bibtex compatible