-
-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix EvilAngel performerByURL -> Refactor Algolia scraping #2177
base: master
Are you sure you want to change the base?
Conversation
I've done a first pass of implementing all the scrapers for EvilAngel.yml with the new AlgoliaAPI.py. There are some TODOs for handling multiple sites, and doing some form of results score matching (e.g. galleryByFragment) when the operation finds multiple API results, but the operation only returns a single scraper result. |
@Maista6969 I think a long time ago, there was a discussion about refactoring the existing Algolia.py, and you were in that discussion? Sorry if I'm mistaken, it was quite some time ago... Anyway, what I have here in this PR is working, albeit with some functionality yet to port over, e..g
I may have overlooked some stuff, but I was wondering if you (or anyone else) has any input, suggestions, requests, etc. at this point? |
Implementing markers would be nice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been wanting to rewrite Algolia for a long time, thank you for contributing this! I certainly agree that the old Algolia has become pretty crufty and I think this is an excellent start on the Road to Refactor 😁
The next step will be creating an EvilAngel.py
that uses this API module so we can have an extra layer of indirection where we can handle the special cases for this site like studio remappings that are currently such a mess in the old Algolia.py
I did wonder what else would be needed, like apart from the studio remapping stuff... I thought that subclassing or importing functions from a "base" script (similar to the Aylo implementation) might be more complex than it's worth (to give flexibility that ultimately isn't needed). With that in mind, I had a think about how the scraper configuration YAML could be used for something like the studio name mapping, and came up with a possible solution like this: import ast
# the API hit dictionary
api_hit = {
'studio_name': 'Enid Blyton',
'serie_name': 'Groovy Gang',
'channel_name': 'Happy Joy',
'sitename': 'thisthing',
'segment': 'something',
}
# these could come in via the `args["extra"] list of strings
conditions_and_values_to_assign = [
"api_hit['studio_name'] == 'Enid Blyton' => api_hit['channel_name']",
"api_hit['segment'] == 'something' => 'a fixed value'",
"api_hit['studio_name'] == 'Not A Match' => api_hit['serie_name']",
]
for condition_and_value_to_assign in conditions_and_values_to_assign:
condition, value_to_assign = condition_and_value_to_assign.split(' => ')
# Parsing and evaluating the condition
parsed_condition = ast.parse(condition, mode='eval')
if eval(compile(parsed_condition, filename="", mode="eval")):
new_variable = eval(value_to_assign)
else:
new_variable = 'default_value'
print(condition_and_value_to_assign)
print(new_variable)
print() when run. this outputs:
I'm not super excited about the use of |
I see what you mean here, but I feel like dynamically evaluating code from a YAML file is even more complex than just having one Python script that calls another Python script 😅 For most sites we might not even need any special handling, see for example the True Amateurs scraper which can just use the general API results and so doesn't have a separate Python script 🙂 edit: whoops originally linked to the wrong scraper here, not Trans Angels but True Amateurs |
You'll have to enlighten me, as in:
I can't see how any of the scraped models provide any marker feature. |
Stash does not currently support scraping markers, but several scrapers have hacked it in because of user demand: it breaks the model of scrapers because instead of just returning results to Stash (where users can decide whether or not they'd like to keep the results) it makes the scraper call the GraphQL API to mutate the scene as it's being scraped It's currently a feature in the Vixen Network scraper as well as the Aylo API, but I'd much prefer to lobby for native support before hacking it into any more scrapers I think it's a moot point here though, as far as I can tell these sites don't have marker data in their APIs |
If you look at the Adult Time json, markers are there under "action_tags". It might be not all the studios have them, same thing happens with Aylo studios. I get your desire to make the support more native, I was just throwing the suggestion out there. |
Thanks, I wasn't aware that they provided these :) I'll make a note of it for when we expand the use of this to AdultTime and the other sites that can use this API 👍 |
yeah to grab markers with a scrape you really want to be sure you matched correctly when you pull them, ideally they are integrated as something we can pass to stash that's like any other scraped metadata but if you want to use a scraper with a confirmation dialog the next best option would probably be a on update hook that looks for a custom flag added by the scraper to remove the flag and add the markers this would happen after the user confirms the scrape and the scene is updated even better would be the ability for a hook for post scrape update |
Scrapers can't register hooks, but I see your point in that we could maintain a separate plugin for this 👍 |
Implemented studio name determination for:
|
As I also make use of the current Adultime (sic) scraper, I made a new I'm about 2/3 of the way of checking through the Sub Studios for Adult Time Originals to add any custom logic needed for the studio names, URLs, and extra preview site pages. I was in two minds about including the new |
Ah ok, I see what you mean... I've added a (just debug logging) function to the AdultTime.py which shows, e.g.
The |
Multiple search hits are now sorted by a match ratio scoring. All that I think is left is to implement is the extra matching based on file info like duration, resolution, etc. |
file metadata (duration, file size) is now used in the match scoring |
Putting back to draft while I move some of the Adult Time studios from their own scraper (e.g. All Girl Massage, Fantasy Massage, etc.) to the new AdultTime scraper |
Just adding a comment here to clarify that this pull request has been set back to "ready for review" status. I think the scraper here works better than the existing one, is much more reusable (as demonstrated with migrated scrapers to use this AlgoliaAPI.py (or their own customised variant) rather than the existing Algolia.py) and should be a lot easier to maintain. @Maista6969 (and anyone else) if you have any input/comments on this, I'd appreciate it. |
Awesome work @nrg101 !! The AdultTime needs some loving though Non-working domains These studios wrongfuly return domain/en/video/
Missing studios from AdultTime.yml
EvilAngel |
Thanks for the feedback on what doesn't work correctly. Those should be easy enough for me to tweak/add. |
ok @SpecialKeta how is that now? if still broken can you give some example URLs? |
glad my changes fixed the previous issues the new list is quite a few where the "site" doesn't even appear to be a dns resolving domain... I could simplify the URL generation logic to use the utility HEAD request at the site homepage to see if it is even an actual website, and then just keep the few tweaks for slug URLs, rather than having to override every site (in available on site in the API) without its own website I guess Algolia API either plays a bit fast and loose with the term "site" which could maybe just mean an area in the network site, like a channel or series, or could be a future site that doesn't exist yet (like some of the AdultTime ones) or perhaps used to be its own site and now cba and they just publish on the network site... Whatever the case, thanks for all the examples, I think I'll add the HEAD request check to address many of these, which could also mean I could simplify some of the scrapers I've already touched that I had to add a bunch of extra logic to prevent the extra URLs being generated for websites that don't exist |
@SpecialKeta I've added a HEAD 200 check for the website of an "availableOnSite" domain API result, so now URLs for non-working/existing sites should not be scraped. How does that look now? |
Overview
I started this as an attempt to fix the
performerByURL
scraping for EvilAngel, but it turned into the long overdue effort to overhaul the Algolia script.Scraper type(s)
Outstanding tasks
Examples to test
performerByURL
performerByName
Create Performer > Scrape with... > EvilAngel > Performer Name = Ariel
Ariel Demure
from the resultsperformerByFragment
do the Create Performer search action above
go to an existing performer that has scenes at Evil Angel > Edit > Scrape with... > EvilAngel
sceneByURL
movieByURL
galleryByURL
Short description
Problem
Recently, many Algolia-based sites have closed the free access to pages like:
This means you can no longer browse videos, performers and movies on sites like evilangel.com, genderxfilms.com, and a whole load of other sites. This is especially annoying for performer scraping as that is not implemented in the current Algolia.py
Solution
There is actually a full Python client for the Algolia API, and all that's needed is fetching the appId and apiKey, and setting the host and referer headers. By refering to that client's docs, the current Algolia.py, and the Aylo API script, I've cobbled together a working:
The current Algolia.py is a whole load of jank taped together and is long overdue an overhaul. Rather than trying to refactor it in-place, I've decided to make a new script called AlgoliaAPI.py, so that each site scraper can be migrated over individually.
The good parts of the existing Algolia.py should be included now.