Skip to content

Implement ScienceDirectSearch using the PUT method #396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

nils-herrmann
Copy link
Collaborator

@nils-herrmann nils-herrmann commented May 7, 2025

PR for Issue #395. To implement the PUT method the whole search pipeline had to be extended. Summarised, the following changes were conducted:

  • Extension of get_content to use the PUT methods
  • Extension of Base. Unfortunately, the pagination logic is different to the ScopusSearch which uses GET. Therefore I had to use a new conditional clause.
  • Extension of the Retrieve class. Here were also some incompatibilities (query is a nested dictionary) which involved new conditional clauses.
  • Complete new ScienceDirectSearch class since the returned results where in a new format.

@nils-herrmann nils-herrmann marked this pull request as draft May 7, 2025 16:21
@nils-herrmann nils-herrmann marked this pull request as ready for review May 7, 2025 16:48
@Michael-E-Rose
Copy link
Contributor

That's a same that the query string changes. I'm thinking about how to minimize the necessary changes on the user end and how to stay as close to the other classes as possible.

My first question is whether the qs key in the search dict is mandatory.

@nils-herrmann
Copy link
Collaborator Author

Here is the request schema from the documentation:

{
    authors: string,
    date: string,
    display: {
        highlights: boolean,
        offset: integer,
        show: integer,
        sortBy: string
    },
    filters: {
        openAccess: boolean
    },
    issue: string,
    loadedAfter: string,
    page: string,
    pub: string,
    qs: string,
    title: string,
    volume: string
}

There are no mandatory fields. However if the query has to many results we get Rate of requests exceeds specified limits. Recommend lowering request rate and/or concurrency of requests.

@Michael-E-Rose
Copy link
Contributor

Do users use either qs or the others?

@nils-herrmann
Copy link
Collaborator Author

nils-herrmann commented May 12, 2025

There is no restriction, users can also use both. However, using for example title and qs does not make sense since qs already queries all the fields.

From the documentation:

A free text search using the GET interface is equivalent to using qs with the PUT interface.

@Michael-E-Rose
Copy link
Contributor

In this case I would suggest the following:
The default way of interacting with this class is the qs string, which we calll query. That's the same way as other classes expect input. This will also serve as filename (in the hashed version).
Then we enable kwds and args to take over some other fields.

Consistency is key, as is the requirement to use information for the cache file.

Can you implement that please?

@nils-herrmann
Copy link
Collaborator Author

nils-herrmann commented May 18, 2025

I implemented the suggestion to pass qs via the query argument. Now we can use the class by passing a query string and keyword arguments:

sds = ScienceDirectSearch('"neural radiance fields" AND "3D rendering"', date='2024')

Regarding the cache, we cannot only use the query argument since queries with an empty query would have the same filename. Consider the following example:

sds_1 = ScienceDirectSearch(title='Assessing LLMs in malicious code deobfuscation of real-world malware campaigns', date='2024')
sds_2 = ScienceDirectSearch(title='Sampling latent material-property information from LLM-derived embedding representations', date='2024')
sds_1._cache_file_path == sds_2._cache_file_path

True

Therefore I suggest to keep the current implementation and use the complete flattened query dictionary as name.

@Michael-E-Rose
Copy link
Contributor

Therefore I suggest to keep the current implementation and use the complete flattened query dictionary as name.

Agreed, but let's use that only when there is no query.

@nils-herrmann nils-herrmann force-pushed the science_direct_search branch from 5b09b03 to 4c5e213 Compare June 13, 2025 15:10
@Michael-E-Rose
Copy link
Contributor

It will be a problem if we remove functionality from the current classes. They're in use already. And frankly I prefer more data even if it is retrieved in a non-recommended way than less data retrieved the right way.

I would put the PR on hold until ScienceDirect removes the GET method altogether.

@nils-herrmann
Copy link
Collaborator Author

With the GET method we only have two extra fields which we could reconstruct (details below).

Old Method (GET) New Method (PUT)
authors authors
first_author Na
doi doi
title title
link uri
load_date loadDate
openaccess_status openAccess
pii pii
coverDate publicationDate
endingPage last_page
publicationName sourceTitle
startingPage first_page
api_link Na
volume volumeIssue

The PUT method also returns the order of the authors:

{'authors': [{'order': 1, 'name': 'Constantinos Patsakis'},
                    {'order': 2, 'name': 'Fran Casino'},
                    {'order': 3, 'name': 'Nikolaos Lykousas'}]}

The api_link can be reconstructed with the pii: f"https://api.elsevier.com/content/article/pii/{pii}"

We could also keep the field names of the old method (GET)

Finally, what I find most problematic of GET is not being able to filter the results by date and therefore getting misleading counts.

@Michael-E-Rose
Copy link
Contributor

Ok, then we can keep it. But in any case, this requires a new major version. Not soon, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants