Skip to content

Introduce CachedSupplier for BasePersistence objects #1765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

adnanhemani
Copy link
Collaborator

I came across an interesting bug yesterday that we need to fix to ensure that tasks can use the BasePersistence object, as they run outside of user call contexts.

What I was trying to do:

  1. Create and run a Task which dumps some information to the persistence. In order to do this, I was using the following line of code to get a BasePersistence object: metaStoreManagerFactory.getOrCreateSessionSupplier(CallContext.getCurrentContext().getRealmContext()).get();
  2. Get the following error message when executing the last .get() call:
jakarta.enterprise.context.ContextNotActiveException: RequestScoped context was not active when trying to obtain a bean instance for a client proxy...

When digging deeper into why this is happening, I realized that due to the Supplier's lazy-loading at https://github.com/apache/polaris/blob/main/extension/persistence/relational-jdbc/src/main/java/org/apache/polaris/extension/persistence/relational/jdbc/JdbcMetaStoreManagerFactory.java#L100-L105, the .get() was actually using a RequestScoped realmContext bean given by the previously-ran TokenBroker initialization (which is a RequestScoped object here: https://github.com/apache/polaris/blob/main/quarkus/service/src/main/java/org/apache/polaris/service/quarkus/config/QuarkusProducers.java#L290-L299. Given this is a relatively-new addition, this may be why we haven't seen this bug previously.

As Tasks run asynchronously, likely after the original request was already completed, this error actually makes sense - we should not be able to use a request scoped bean inside of a Task execution. But upon further looking, we do not actually need realmContext for anything other than resolving the realmIdentifier once during the BasePersistence object initialization - as a result, we can cache the BasePersistence object using a supplier that caches the original result instead of constantly making new objects. This will also solve our issue, as the original request scoped RealmContext bean will not be used again during the Task's call to get a BasePersistence object.

I've added a test case that shows the difference between the OOTB supplier and my ideal way to solve this problem using a CachedSupplier. If there is significant concern that we cannot cache the BasePersistence object, we can materialize the RealmContext object prior to the supplier so that at a minimum the RequestScoped RealmContext object is not being used - but I'm not sure if there's an easy way to test this, given that the MetastoreFactories are Quarkus ApplicationScoped objects.

Please note, this is an issue in both EclipseLink and JDBC, as they have almost identical code paths here.

Many thanks to @singhpk234 for being my debugging rubber ducky :)

@adnanhemani
Copy link
Collaborator Author

adnanhemani commented May 31, 2025

cc @dimas-b (as you are looking at the similar issue at #1758), @eric-maynard , @collado-mike

edit: sorry, wrong PR number

@adnanhemani
Copy link
Collaborator Author

cc @adutra as well as you are also looking through #1758 .

@adutra
Copy link
Contributor

adutra commented Jun 2, 2025

@adnanhemani thanks for bringing my attention to this PR.

I realized that due to the Supplier's lazy-loading [...] the .get() was actually using a RequestScoped realmContext bean given by the previously-ran TokenBroker initialization

Hmm I looked at your code snippets but I don't see the connection between the TokenBroker bean production and the lazy loading of JdbcBasePersistenceImpl. But assuming that this is happening inside a task executor thread, and the problem is RealmContext, why don't you resolve the realmId eagerly? E.g.:

  private void initializeForRealm(
      RealmContext realmContext, RootCredentialsSet rootCredentialsSet, boolean isBootstrap) {
    String realmId = realmContext.getRealmIdentifier(); // resolve realm ID eagerly
    DatasourceOperations databaseOperations = getDatasourceOperations(isBootstrap);
    sessionSupplierMap.put(
        realmId,
        () ->
            new JdbcBasePersistenceImpl(
                databaseOperations,
                secretsGenerator(() -> realmId, rootCredentialsSet),
                storageIntegrationProvider,
                realmId));

    PolarisMetaStoreManager metaStoreManager = createNewMetaStoreManager();
    metaStoreManagerMap.put(realmId, metaStoreManager);
  }

@adnanhemani
Copy link
Collaborator Author

@adutra thanks for taking a look :)

Hmm I looked at your code snippets but I don't see the connection between the TokenBroker bean production and the lazy loading of JdbcBasePersistenceImpl

The connection is that the TokenBroker bean is RequestScoped and it does create a BasePersistence Supplier object as part of the bean initialization using the realmContext in the RequestScoped bean initialization. That BasePersistence Supplier is then stored in the sessionSupplierMap and attempted to be used during the lazy loading - which then tries to load the bean's (now-expired) realmContext. Not sure if this is more clear? Let me know what part that's not clear if not!

But assuming that this is happening inside a task executor thread, and the problem is RealmContext, why don't you resolve the realmId eagerly?

Yes, this was my original idea - but was hard for me to construct a test case for this type of fix. Maybe this is something you've had more experience with - but using a request-scoped realmContext bean during a test was something I just wasn't able to do at all. Additionally, I'm just not sure that we're getting any use for continuously re-creating JdbcBasePersistenceImpl objects - is there really any good reason for us to lazy load this? If not, why not cache the object as-is?

As a result, I'm promoting the CachedSupplier as our preferred way to solve this issue instead. But I'm not heavily tied to this approach if we have a better way to test the way that you suggested.

@adutra
Copy link
Contributor

adutra commented Jun 2, 2025

The connection is that the TokenBroker bean is RequestScoped and it does create a BasePersistence Supplier

I still don't see any TokenBroker creating any BasePersistence anywhere in the code 🤔

@adnanhemani as it stands, this PR is imo not mergeable: it has no clear error description, no stack trace that we can investigate, no reproducer, and no test case (CachedSupplierTest is just a unit test, but there is no test that shows evidence of a broken behavior that would be "fixed" by the proposed changes).

@adnanhemani
Copy link
Collaborator Author

adnanhemani commented Jun 5, 2025

@adutra - I've reproduced the issue on a branch in my fork: https://github.com/adnanhemani/polaris/tree/ahemani/show_failure_1765

You can read the full diff here, but I made a really simple case here that creates a task when you create a catalog. The task only tries to get the BasePersistence object - which is where the call blows up due to the poisoned cache. Feel free to stick a debugger in there and you'll be able to see, it is because of the lazy loading of the JdbcBasePersistenceImpl class and that the cache poisoning happened due to the creation of the TokenBroker (RequestScoped) bean.

Steps to reproduce the error using the code linked above:

  1. [This can only be reproduced using JDBC or EclipseLink.] Create a Persistence instance and set application.properties to the right set of configurations.
  2. Run: ./polaris --client-id <CLIENT_ID> --client-secret <CLIENT_SECRET> catalogs create polaris1 --storage-type FILE --default-base-location "/var/tmp/polaris1/" (you must try this
  3. Wait for the Task to execute. This will fail and retry - until it runs out of retries altogether and then will print out to logs that the task cannot be successfully completed. A call trace will also be able to be seen here.

You can then apply this PR on top of that code and retry these steps and see that you will no longer see this issue.

More on how the TokenBroker creates the poisoned cache:

  1. tokenBrokerFactory.apply(realmContext): https://github.com/apache/polaris/blob/main/quarkus/service/src/main/java/org/apache/polaris/service/quarkus/config/QuarkusProducers.java#L289. Note this is a RequestScoped bean - and so is realmContext.
  2. createTokenBroker(realmContext): https://github.com/apache/polaris/blob/main/service/common/src/main/java/org/apache/polaris/service/auth/JWTRSAKeyPairFactory.java#L53
  3. metaStoreManagerFactory.getOrCreateMetaStoreManager(realmContext): https://github.com/apache/polaris/blob/main/service/common/src/main/java/org/apache/polaris/service/auth/JWTRSAKeyPairFactory.java#L65-L66
  4. initializeForRealm(realmContext, null, false);: https://github.com/apache/polaris/blob/main/persistence/relational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/JdbcMetaStoreManagerFactory.java#L177

And that call is where the sessionSupplierMap stores the poisoned lambda that creates JdbcBasePersistenceImpl. At no point in this call trace was realmContext reset with a materialized version of the realmIdentifier - which is why it remains as a RequestScoped bean that made its way into the sessionSupplierMap.

Again, your suggestion above to change this behavior by materializing the realmContext (perhaps from the tokenBroker itself) will solve this issue. But I have no idea how to make a test that ensures something like that cannot happen again. If you have an idea, then glad to change the approach to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants