Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix EvilAngel performerByURL -> Refactor Algolia scraping #2177

Open
wants to merge 75 commits into
base: master
Choose a base branch
from

Conversation

nrg101
Copy link
Contributor

@nrg101 nrg101 commented Jan 24, 2025

Overview

I started this as an attempt to fix the performerByURL scraping for EvilAngel, but it turned into the long overdue effort to overhaul the Algolia script.

Scraper type(s)

  • performerByName
  • performerByFragment
  • performerByURL
  • sceneByName
  • sceneByQueryFragment
  • sceneByFragment
  • sceneByURL
  • movieByURL
  • galleryByFragment
  • galleryByURL

Outstanding tasks

  • implement search match scoring/comparison
  • implement studio name determination logic
  • handle multiple sites searching (is this ever needed? I'm going to say no at this point)
  • match by file info (e.g. duration, resolution, whatever)

Examples to test

performerByURL

performerByName

Create Performer > Scrape with... > EvilAngel > Performer Name = Ariel

  • select Ariel Demure from the results

performerByFragment

do the Create Performer search action above

  • see the scraped performer

go to an existing performer that has scenes at Evil Angel > Edit > Scrape with... > EvilAngel

  • select the performer from the results
  • see the additional/new/different scraped data

sceneByURL

movieByURL

galleryByURL

Short description

Problem

Recently, many Algolia-based sites have closed the free access to pages like:

  • /en/videos
  • /en/pornstars
  • /en/movies
  • /en/video/evilangel/TS-SOPHIA-MONTESINO-Spunky-Anal-Date/256714
  • /en/movie/Transgressive-25/126353
  • /en/pornstar/view/Brittney-Kade/92399

This means you can no longer browse videos, performers and movies on sites like evilangel.com, genderxfilms.com, and a whole load of other sites. This is especially annoying for performer scraping as that is not implemented in the current Algolia.py

Solution

There is actually a full Python client for the Algolia API, and all that's needed is fetching the appId and apiKey, and setting the host and referer headers. By refering to that client's docs, the current Algolia.py, and the Aylo API script, I've cobbled together a working:

  • performerByURL -> lookup performer by URL (the ID at the end)
  • performerByName -> searches for up to 20 performers matching a text string
  • performerByFragment -> looks up performer from one of the search results from performerByName

The current Algolia.py is a whole load of jank taped together and is long overdue an overhaul. Rather than trying to refactor it in-place, I've decided to make a new script called AlgoliaAPI.py, so that each site scraper can be migrated over individually.

The good parts of the existing Algolia.py should be included now.

@nrg101 nrg101 marked this pull request as draft January 24, 2025 02:41
@nrg101
Copy link
Contributor Author

nrg101 commented Jan 24, 2025

I've done a first pass of implementing all the scrapers for EvilAngel.yml with the new AlgoliaAPI.py.

There are some TODOs for handling multiple sites, and doing some form of results score matching (e.g. galleryByFragment) when the operation finds multiple API results, but the operation only returns a single scraper result.

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 24, 2025

@Maista6969 I think a long time ago, there was a discussion about refactoring the existing Algolia.py, and you were in that discussion? Sorry if I'm mistaken, it was quite some time ago...

Anyway, what I have here in this PR is working, albeit with some functionality yet to port over, e..g

  • galleryByFragment multiple results match-scoring to return best match as single result... I think this is all the match ratio jank in the existing Algolia.py
  • anything that scrapes a studio could do with a set of logic to determine the studio name from the studio_name, network, serie, channel, sitename, etc. I think this would be really nice if it could be one of the "extra" array items in the respective YAML, so I will see if that's feasible in a nice way that everyone can get along with
  • handling multiple sites... I'm not sure how important this is, but there may be scenarios where the user would like to search more than one site

I may have overlooked some stuff, but I was wondering if you (or anyone else) has any input, suggestions, requests, etc. at this point?

@ltgorman
Copy link
Contributor

@Maista6969 I think a long time ago, there was a discussion about refactoring the existing Algolia.py, and you were in that discussion? Sorry if I'm mistaken, it was quite some time ago...

Anyway, what I have here in this PR is working, albeit with some functionality yet to port over, e..g

  • galleryByFragment multiple results match-scoring to return best match as single result... I think this is all the match ratio jank in the existing Algolia.py
  • anything that scrapes a studio could do with a set of logic to determine the studio name from the studio_name, network, serie, channel, sitename, etc. I think this would be really nice if it could be one of the "extra" array items in the respective YAML, so I will see if that's feasible in a nice way that everyone can get along with
  • handling multiple sites... I'm not sure how important this is, but there may be scenarios where the user would like to search more than one site

I may have overlooked some stuff, but I was wondering if you (or anyone else) has any input, suggestions, requests, etc. at this point?

Implementing markers would be nice.

Copy link
Collaborator

@Maista6969 Maista6969 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wanting to rewrite Algolia for a long time, thank you for contributing this! I certainly agree that the old Algolia has become pretty crufty and I think this is an excellent start on the Road to Refactor 😁

The next step will be creating an EvilAngel.py that uses this API module so we can have an extra layer of indirection where we can handle the special cases for this site like studio remappings that are currently such a mess in the old Algolia.py

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 25, 2025

The next step will be creating an EvilAngel.py that uses this API module so we can have an extra layer of indirection where we can handle the special cases for this site like studio remappings that are currently such a mess in the old Algolia.py

I did wonder what else would be needed, like apart from the studio remapping stuff...

I thought that subclassing or importing functions from a "base" script (similar to the Aylo implementation) might be more complex than it's worth (to give flexibility that ultimately isn't needed).

With that in mind, I had a think about how the scraper configuration YAML could be used for something like the studio name mapping, and came up with a possible solution like this:

import ast

# the API hit dictionary
api_hit = {
    'studio_name': 'Enid Blyton',
    'serie_name': 'Groovy Gang',
    'channel_name': 'Happy Joy',
    'sitename': 'thisthing',
    'segment': 'something',
}

# these could come in via the `args["extra"] list of strings
conditions_and_values_to_assign = [
    "api_hit['studio_name'] == 'Enid Blyton' => api_hit['channel_name']",
    "api_hit['segment'] == 'something' => 'a fixed value'",
    "api_hit['studio_name'] == 'Not A Match' => api_hit['serie_name']",
]

for condition_and_value_to_assign in conditions_and_values_to_assign:
    condition, value_to_assign = condition_and_value_to_assign.split(' => ')
    # Parsing and evaluating the condition
    parsed_condition = ast.parse(condition, mode='eval')
    if eval(compile(parsed_condition, filename="", mode="eval")):
        new_variable = eval(value_to_assign)
    else:
        new_variable = 'default_value'

    print(condition_and_value_to_assign)
    print(new_variable)
    print()

when run. this outputs:

api_hit['studio_name'] == 'Enid Blyton' => api_hit['channel_name']
Happy Joy

api_hit['segment'] == 'something' => 'a fixed value'
a fixed value

api_hit['studio_name'] == 'Not A Match' => api_hit['serie_name']
default_value

I'm not super excited about the use of eval, but it could be a solution for the studio mapping logic

@Maista6969
Copy link
Collaborator

Maista6969 commented Jan 25, 2025

I see what you mean here, but I feel like dynamically evaluating code from a YAML file is even more complex than just having one Python script that calls another Python script 😅

For most sites we might not even need any special handling, see for example the True Amateurs scraper which can just use the general API results and so doesn't have a separate Python script 🙂

edit: whoops originally linked to the wrong scraper here, not Trans Angels but True Amateurs

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 25, 2025

Implementing markers would be nice.

You'll have to enlighten me, as in:

  • what is a marker?
  • what in the Algolia API provides data for markers?
  • how do markers get saved/persisted?

I can't see how any of the scraped models provide any marker feature.

@Maista6969
Copy link
Collaborator

Stash does not currently support scraping markers, but several scrapers have hacked it in because of user demand: it breaks the model of scrapers because instead of just returning results to Stash (where users can decide whether or not they'd like to keep the results) it makes the scraper call the GraphQL API to mutate the scene as it's being scraped

It's currently a feature in the Vixen Network scraper as well as the Aylo API, but I'd much prefer to lobby for native support before hacking it into any more scrapers

I think it's a moot point here though, as far as I can tell these sites don't have marker data in their APIs

@ltgorman
Copy link
Contributor

Stash does not currently support scraping markers, but several scrapers have hacked it in because of user demand: it breaks the model of scrapers because instead of just returning results to Stash (where users can decide whether or not they'd like to keep the results) it makes the scraper call the GraphQL API to mutate the scene as it's being scraped

It's currently a feature in the Vixen Network scraper as well as the Aylo API, but I'd much prefer to lobby for native support before hacking it into any more scrapers

I think it's a moot point here though, as far as I can tell these sites don't have marker data in their APIs

If you look at the Adult Time json, markers are there under "action_tags". It might be not all the studios have them, same thing happens with Aylo studios. I get your desire to make the support more native, I was just throwing the suggestion out there.

@Maista6969
Copy link
Collaborator

If you look at the Adult Time json, markers are there under "action_tags". It might be not all the studios have them, same thing happens with Aylo studios. I get your desire to make the support more native, I was just throwing the suggestion out there.

Thanks, I wasn't aware that they provided these :) I'll make a note of it for when we expand the use of this to AdultTime and the other sites that can use this API 👍

@stg-annon
Copy link
Contributor

yeah to grab markers with a scrape you really want to be sure you matched correctly when you pull them, ideally they are integrated as something we can pass to stash that's like any other scraped metadata but if you want to use a scraper with a confirmation dialog the next best option would probably be a on update hook that looks for a custom flag added by the scraper to remove the flag and add the markers this would happen after the user confirms the scrape and the scene is updated even better would be the ability for a hook for post scrape update

@Maista6969
Copy link
Collaborator

yeah to grab markers with a scrape you really want to be sure you matched correctly when you pull them, ideally they are integrated as something we can pass to stash that's like any other scraped metadata but if you want to use a scraper with a confirmation dialog the next best option would probably be a on update hook that looks for a custom flag added by the scraper to remove the flag and add the markers this would happen after the user confirms the scrape and the scene is updated even better would be the ability for a hook for post scrape update

Scrapers can't register hooks, but I see your point in that we could maintain a separate plugin for this 👍

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 27, 2025

Implemented studio name determination for:

  • studios listed in EvilAngel.yml
  • TransPlaytime, as those scenes have evilangel.com URLs

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 28, 2025

As I also make use of the current Adultime (sic) scraper, I made a new AdultTime folder with a new scraper that extends the new AlgoliaAPI, in a similar way to the new EvilAngel.py

I'm about 2/3 of the way of checking through the Sub Studios for Adult Time Originals to add any custom logic needed for the studio names, URLs, and extra preview site pages.

I was in two minds about including the new AdultTime scraper in this branch/PR, but I have been refining the code in the AlgoliaAPI.py while checking through all the AdultTime sub studios, so it seems directly relevant and I would imagine after all the EvilAngel and AdultTime studios are covered and working correctly, that will be a solid reference for any other networks that use Algolia API.

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 29, 2025

If you look at the Adult Time json, markers are there under "action_tags".

Ah ok, I see what you mean... I've added a (just debug logging) function to the AdultTime.py which shows, e.g.

[Scrape / AdultTime] action_tags: [{'name': 'Rimming', 'timecode': 1814}, {'name': 'Deepthroat', 'timecode': 1138}, {'name': 'Anal Toys', 'timecode': 710}, {'name': 'Anal', 'timecode': 2174}, {'name': 'Cum in Mouth', 'timecode': 2960}, {'name': 'Anal', 'timecode': 2072}, {'name': 'Cum in Mouth', 'timecode': 2898}, {'name': 'Hair Pulling', 'timecode': 648}, {'name': 'Face Fucking', 'timecode': 1017}, {'name': 'Rimming', 'timecode': 1638}, {'name': 'Anal Pile Driving', 'timecode': 2600}, {'name': 'Anal', 'timecode': 2276}, {'name': 'Deepthroat', 'timecode': 1462}, {'name': 'Anal Pile Driving', 'timecode': 2796}, {'name': 'Choking', 'timecode': 2307}, {'name': 'Anal Pile Driving', 'timecode': 2696}, {'name': 'Deepthroat', 'timecode': 1357}, {'name': 'Anal Reverse Cowgirl', 'timecode': 2351}, {'name': 'Deepthroat', 'timecode': 1600}, {'name': 'Dick Slap', 'timecode': 604}, {'name': 'Deepthroat', 'timecode': 1530}, {'name': 'Face Fucking', 'timecode': 1254}, {'name': 'Spitting', 'timecode': 1717}, {'name': 'Cumshot', 'timecode': 2994}, {'name': 'Handjob', 'timecode': 869}, {'name': 'Ball Sucking', 'timecode': 1867}, {'name': 'Anal Pile Driving', 'timecode': 2512}, {'name': 'Deepthroat', 'timecode': 1508}, {'name': 'Face Fucking', 'timecode': 1403}, {'name': 'Rimming', 'timecode': 1905}, {'name': 'Dick Slap', 'timecode': 1436}, {'name': 'Handjob', 'timecode': 1989}]

The process_action_tags function could be made to do something else other than just logging out, like add markers via graphQL.

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 29, 2025

Multiple search hits are now sorted by a match ratio scoring.

All that I think is left is to implement is the extra matching based on file info like duration, resolution, etc.

@nrg101
Copy link
Contributor Author

nrg101 commented Jan 30, 2025

file metadata (duration, file size) is now used in the match scoring

@nrg101 nrg101 marked this pull request as ready for review January 30, 2025 01:48
@nrg101 nrg101 requested a review from Maista6969 January 30, 2025 01:48
@nrg101 nrg101 marked this pull request as draft January 30, 2025 16:16
@nrg101
Copy link
Contributor Author

nrg101 commented Jan 30, 2025

Putting back to draft while I move some of the Adult Time studios from their own scraper (e.g. All Girl Massage, Fantasy Massage, etc.) to the new AdultTime scraper

@nrg101 nrg101 marked this pull request as ready for review January 30, 2025 18:17
@nrg101
Copy link
Contributor Author

nrg101 commented Mar 12, 2025

Just adding a comment here to clarify that this pull request has been set back to "ready for review" status.

I think the scraper here works better than the existing one, is much more reusable (as demonstrated with migrated scrapers to use this AlgoliaAPI.py (or their own customised variant) rather than the existing Algolia.py) and should be a lot easier to maintain.

@Maista6969 (and anyone else) if you have any input/comments on this, I'd appreciate it.

@SpecialKeta
Copy link
Contributor

Awesome work @nrg101 !!

The AdultTime needs some loving though

Non-working domains

These studios wrongfuly return domain/en/video/

Missing studios from AdultTime.yml

  • devilsgangbangs.com/en/video
  • devilsfilmparodies.com/en/video
  • thebrats.com/en/video
  • wheretheboysarent.com/en/video
  • whiteghetto.com/en/video

EvilAngel
Scraping Lexington Steele scene returns lexingtonsteele.com but that's non-working domain

@nrg101
Copy link
Contributor Author

nrg101 commented Mar 13, 2025

Thanks for the feedback on what doesn't work correctly. Those should be easy enough for me to tweak/add.

@nrg101
Copy link
Contributor Author

nrg101 commented Mar 17, 2025

ok @SpecialKeta how is that now? if still broken can you give some example URLs?

@SpecialKeta
Copy link
Contributor

SpecialKeta commented Mar 18, 2025

Changes look good, great work!

Below are some more URLs I checked. I haven't checked 21Bonus substudios yet.

### AdultTime substudios

Adult Time Films
https://members.adulttime.com/en/video/adulttime/An-Americana-Orgy/207204 is behind paywall, scraper returns to https://members.adulttime.com/en/video/adulttime/An-Americana-Orgy/207204

Adult Time x Vixen
https://www.adulttime.com/en/video/Vixen/Angela-White-Fucks-A-HUGE-Cock/171884 >: non-working https://www.vixen.com/en/video/vixen/Angela-White-Fucks-A-HUGE-Cock/171884

UpCloseX
https://members.adulttime.com/en/video/upclosex/Destiny-Lovee-Up-Close-And-Personal/181393 >: non-working https://www.upclosex.com/en/video/upclosex/Destiny-Lovee-Up-Close-And-Personal/181393

Vivid
https://tour1.vivid.com/en/video/vivid/Naked-Reunion---Part-3/137113 >: https://www.vivid.com/en/video/vivid/Naked-Reunion---Part-3/137113, but should be tour1 instead of www

21st Sextury substudios

Scraper returns 2 urls, obviously the working 21sextury.com, but also a 2nd url. Here's a list with non-working 2nd urls the scraper returns.
Anal Queen Alysa
https://www.21sextury.com/en/video/analqueenalysa/In-Need-of-a-Third/96567 >: https://www.analqueenalysa.com/en/video/analqueenalysa/In-Need-of-a-Third/96567

BlueAngelLive
https://www.21sextury.com/en/video/blueangellive/NudeFightClub-backstage-with-Blue-Angel-and-Ruth-Medina/93096 >: https://www.blueangellive.com/en/video/blueangellive/NudeFightClub-backstage-with-Blue-Angel-and-Ruth-Medina/93096

ButtPlays
https://www.21sextury.com/en/video/buttplays/Dew-Me-Baby/124065 >: https://www.buttplays.com/en/video/buttplays/Dew-Me-Baby/124065

CheatingWhoreWives
https://www.21sextury.com/en/video/cheatingwhorewives/Being-bored/83248 >: https://www.cheatingwhorewives.com/en/video/cheatingwhorewives/Being-bored/83248

CutiesGalore
https://www.21sextury.com/en/video/cutiesgalore/CutiesGalore-presents-Sasha/95928 >: https://www.cutiesgalore.com/en/video/cutiesgalore/CutiesGalore-presents-Sasha/95928

Club Sandy
https://www.21sextury.com/en/video/clubsandy/Sandy-Interview/89338 >: https://www.clubsandy.com/en/video/clubsandy/Sandy-Interview/89338

DeepThroatFrenzy
https://www.21sextury.com/en/video/deepthroatfrenzy/Perfect-Image/144293 >: https://www.deepthroatfrenzy.com/en/video/deepthroatfrenzy/Perfect-Image/144293

Gapeland
https://www.21sextury.com/en/video/gapeland/Having-A-Teen-Girlfriend/144354 >: https://www.gapeland.com/en/video/gapeland/Having-A-Teen-Girlfriend/144354

Hot Milf Club
https://www.21sextury.com/en/video/hotmilfclub/Hot-MILF-Vivian/95534 >: https://www.hotmilfclub.com/en/video/hotmilfclub/Hot-MILF-Vivian/95534

LetsPlayLez
https://www.21sextury.com/en/video/letsplaylez/Wild-plays-with-Maryel-and-Kendra-Star/97475 >: https://www.letsplaylez.com/en/video/letsplaylez/Wild-plays-with-Maryel-and-Kendra-Star/97475

OnlySwallows
https://www.21sextury.com/en/video/onlyswallows/Come-in-Kami/84238 >: https://www.onlyswallows.com/en/video/onlyswallows/Come-in-Kami/84238

https://www.21sextury.com/en/video/sexwithkathianobili/Brothel-Tour/92914 >: https://www.sexwithkathianobili.com/en/video/sexwithkathianobili/Brothel-Tour/92914

Sweet Sophie Moone
https://www.21sextury.com/en/video/sweetsophiemoone/Behind-The-Camera/87520 >: https://www.sweetsophiemoone.com/en/video/sweetsophiemoone/Behind-The-Camera/87520

Pix and Video
https://www.21sextury.com/en/video/pixandvideo/Backstage-with-Regina-Ice-and-Regina--Moon/90243 >: https://www.pixandvideo.com/en/video/pixandvideo/Backstage-with-Regina-Ice-and-Regina--Moon/90243

21eroticanal
https://21naturals.com/en/video/21naturals/Sit-On-This/218114 >: https://www.21eroticanal.com/en/video/21eroticanal/Sit-On-This/218114

21footart
https://www.21naturals.com/en/video/21footart/Craving-Attention/169916 >: https://www.21footart.com/en/video/21footart/Craving-Attention/169916

Evil Angel
lewood.com has changed to Adult Empire Cash, so lewood.com/en/video non-working

@nrg101
Copy link
Contributor Author

nrg101 commented Mar 18, 2025

glad my changes fixed the previous issues

the new list is quite a few where the "site" doesn't even appear to be a dns resolving domain... I could simplify the URL generation logic to use the utility HEAD request at the site homepage to see if it is even an actual website, and then just keep the few tweaks for slug URLs, rather than having to override every site (in available on site in the API) without its own website

I guess Algolia API either plays a bit fast and loose with the term "site" which could maybe just mean an area in the network site, like a channel or series, or could be a future site that doesn't exist yet (like some of the AdultTime ones) or perhaps used to be its own site and now cba and they just publish on the network site...

Whatever the case, thanks for all the examples, I think I'll add the HEAD request check to address many of these, which could also mean I could simplify some of the scrapers I've already touched that I had to add a bunch of extra logic to prevent the extra URLs being generated for websites that don't exist

@nrg101
Copy link
Contributor Author

nrg101 commented Mar 20, 2025

@SpecialKeta I've added a HEAD 200 check for the website of an "availableOnSite" domain API result, so now URLs for non-working/existing sites should not be scraped.

How does that look now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants