Implement `ScienceDirectSearch` using the `PUT` method #396

nils-herrmann · 2025-05-07T16:20:54Z

PR for Issue #395. To implement the PUT method the whole search pipeline had to be extended. Summarised, the following changes were conducted:

Extension of get_content to use the PUT methods
Extension of Base. Unfortunately, the pagination logic is different to the ScopusSearch which uses GET. Therefore I had to use a new conditional clause.
Extension of the Retrieve class. Here were also some incompatibilities (query is a nested dictionary) which involved new conditional clauses.
Complete new ScienceDirectSearch class since the returned results where in a new format.

Michael-E-Rose · 2025-05-12T08:20:45Z

That's a same that the query string changes. I'm thinking about how to minimize the necessary changes on the user end and how to stay as close to the other classes as possible.

My first question is whether the qs key in the search dict is mandatory.

nils-herrmann · 2025-05-12T08:58:30Z

Here is the request schema from the documentation:

{
    authors: string,
    date: string,
    display: {
        highlights: boolean,
        offset: integer,
        show: integer,
        sortBy: string
    },
    filters: {
        openAccess: boolean
    },
    issue: string,
    loadedAfter: string,
    page: string,
    pub: string,
    qs: string,
    title: string,
    volume: string
}

There are no mandatory fields. However if the query has to many results we get Rate of requests exceeds specified limits. Recommend lowering request rate and/or concurrency of requests.

Michael-E-Rose · 2025-05-12T09:11:09Z

Do users use either qs or the others?

nils-herrmann · 2025-05-12T09:36:14Z

There is no restriction, users can also use both. However, using for example title and qs does not make sense since qs already queries all the fields.

From the documentation:

A free text search using the GET interface is equivalent to using qs with the PUT interface.

Michael-E-Rose · 2025-05-15T11:11:54Z

In this case I would suggest the following:
The default way of interacting with this class is the qs string, which we calll query. That's the same way as other classes expect input. This will also serve as filename (in the hashed version).
Then we enable kwds and args to take over some other fields.

Consistency is key, as is the requirement to use information for the cache file.

Can you implement that please?

nils-herrmann · 2025-05-18T18:01:24Z

I implemented the suggestion to pass qs via the query argument. Now we can use the class by passing a query string and keyword arguments:

sds = ScienceDirectSearch('"neural radiance fields" AND "3D rendering"', date='2024')

Regarding the cache, we cannot only use the query argument since queries with an empty query would have the same filename. Consider the following example:

sds_1 = ScienceDirectSearch(title='Assessing LLMs in malicious code deobfuscation of real-world malware campaigns', date='2024')
sds_2 = ScienceDirectSearch(title='Sampling latent material-property information from LLM-derived embedding representations', date='2024')
sds_1._cache_file_path == sds_2._cache_file_path

True

Therefore I suggest to keep the current implementation and use the complete flattened query dictionary as name.

Michael-E-Rose · 2025-05-23T07:26:58Z

Therefore I suggest to keep the current implementation and use the complete flattened query dictionary as name.

Agreed, but let's use that only when there is no query.

pybliometrics/sciencedirect/sciencedirect_search.py

pybliometrics/sciencedirect/tests/test_ScienceDirectSearch.py

Michael-E-Rose · 2025-06-16T10:25:03Z

It will be a problem if we remove functionality from the current classes. They're in use already. And frankly I prefer more data even if it is retrieved in a non-recommended way than less data retrieved the right way.

I would put the PR on hold until ScienceDirect removes the GET method altogether.

nils-herrmann · 2025-06-17T15:48:49Z

With the GET method we only have two extra fields which we could reconstruct (details below).

Old Method (`GET`)	New Method (`PUT`)
authors	authors
first_author	Na
doi	doi
title	title
link	uri
load_date	loadDate
openaccess_status	openAccess
pii	pii
coverDate	publicationDate
endingPage	last_page
publicationName	sourceTitle
startingPage	first_page
api_link	Na
volume	volumeIssue

The PUT method also returns the order of the authors:

{'authors': [{'order': 1, 'name': 'Constantinos Patsakis'},
                    {'order': 2, 'name': 'Fran Casino'},
                    {'order': 3, 'name': 'Nikolaos Lykousas'}]}

The api_link can be reconstructed with the pii: f"https://api.elsevier.com/content/article/pii/{pii}"

We could also keep the field names of the old method (GET)

Finally, what I find most problematic of GET is not being able to filter the results by date and therefore getting misleading counts.

Michael-E-Rose · 2025-06-18T18:19:14Z

Ok, then we can keep it. But in any case, this requires a new major version. Not soon, though.

nils-herrmann marked this pull request as draft May 7, 2025 16:21

nils-herrmann marked this pull request as ready for review May 7, 2025 16:48