Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional deadlocks on DB #1675

Open
dukhaSlayer opened this issue Jan 14, 2025 · 4 comments
Open

Occasional deadlocks on DB #1675

dukhaSlayer opened this issue Jan 14, 2025 · 4 comments

Comments

@dukhaSlayer
Copy link
Contributor

dukhaSlayer commented Jan 14, 2025

Sooo...
After submitting this PR #1674 I looked at the code and it seems there are more issues of this nature.

The problem is that sometimes code that holds an open connection to the DB calls another component that tries to open another connection.

As an example:
AccountRecordProcessor is constructed with a Connection, yet in a call to StorageSyncModels.localToRemoteRecord it passes account.getConfigurationStore(), which will try to create a new Connection with keyValueStore.

This particular example might not be critical but there are more issues like that and under load they produce errors while working with storage now and then.

For us, it leads to complete session degradation within a week with the client being completely unable to decrypt incoming messages from known contacts.

I modifier Database.getConnection and found some more deadlocks (basically, SQLBUSY exceptions on store modifications):

    private static final Map<String, String> OpenConnections = new LinkedHashMap<>();

    public final Connection getConnection() throws SQLException {
        var id = UUID.randomUUID().toString();
        var stack = Arrays.stream(Thread.currentThread().getStackTrace()).map(Objects::toString).collect(Collectors.joining(", "));
        logger.trace("getConnection: {} on {}", id, stack);
        OpenConnections.put(id, stack);
        final Connection connection;
        try {
            connection = dataSource.getConnection();
        } catch (SQLException e) {
            logger.error("getConnection() failed for {}, possible deadlock. Open connections (the last one is failed):\n {}",
                id,
                OpenConnections.entrySet().stream().map(kv -> kv.getKey() + ": " + kv.getValue()).collect(Collectors.joining("; \n")),
                e);
            throw e;
        }
        return wrap(logger, id, connection);
    }

    public static Connection wrap(Logger logger, String id, Connection originalConnection) {
        return (Connection) Proxy.newProxyInstance(
            originalConnection.getClass().getClassLoader(),
            new Class[]{Connection.class},
            (proxy, method, args) -> {
                if ("close".equals(method.getName())) {
                    logger.trace("Connection.close() for {}", id);
                    OpenConnections.remove(id);
                }
                return method.invoke(originalConnection, args);
            }
        );
    }

There are many ways to address the issue, e.g.

  • "open" a connection for each operation never holding/passing it around
  • using interfaces/inheritance/wrappers to hide the fact that we're using an existing connection from other components (can be done in many ways). I have done some exercises with the code using this approach and so far I don't see error logs. Will do more testing today/tomorrow. Can submit a patch from what I did so far if you are interested.
  • the best way I can come up with is using an "opaque" db/store context which can provide a connection. The top-level context is constructed in jobs or on message arrival and we don't store it to account but rather pass it around to all methods that require access to DB (to avoid cluttering the code we can pass some other useful things in that context, like loggers, account itself etc). Then when we need an ad-hoc connection, there's a context.withConnection(c -> { ...}) method and if we want to do multiple operations we try (var ctx = ctx.connect()) {} which produces a context of the same type but with an already opened connection. This will allow to have functions, components, stores and helpers that can access DB but are unaware in which context they are called (with or without opened connection). So the code will be really simple and safe.
@AsamK
Copy link
Owner

AsamK commented Jan 14, 2025

The signal-cli database implementation started the first point, i.e. get a new connection for each operation. Then when implementing the storage sync it was extended to use existing connections so reading from storage could be done in a single transaction to prevent partial updates in case of failures.
The current model with some methods taking a connection and some creating a new one is a result of that. And that has indeed been a bit error prone ...

I'd be interested to see your changes.
Adding an additional context would probably be a bigger change, though maybe it could also be implemented with ThreadLocals or ScopedValues.

@dukhaSlayer
Copy link
Contributor Author

So, I spent some time working with code and I think I got rid of these errors. Also, from my logs it seems that they might have been a reason for some integrity issues which I experienced (not being able to exchange contact information between devices or even decrypt messages). Together with this change: #1690 it made situation way better for me.

So, what did I change with DB:

  1. Updated Database config in org.asamk.signal.manager.storage.Database#getHikariDataSource to have config.setMaximumPoolSize(1), so all connections are exclusive.
  2. Set connection timeout to 20 seconds. SQLite is pretty fast and that should be enough for a transaction to settle.
  3. Enabled WAL, but in a separate statement, adding pragma to setConnectionInitSql didn't help (you can see that by the presence of the .wal file on disk)
  4. Added the code I posted before in this issue ^^ to detect deadlocks. With this code, I found a number of places when a thread holding a connection already attempted to establish a new connection and thus locked indefinitely.
  5. Because some of the cases with deadlocks were not very obvious, I made some naive modifications to the code to "automate" connection presence. Didn't have time to think it through and implement it in a good way. Maybe will do it next weekend.

@dukhaSlayer
Copy link
Contributor Author

signalcli.patch.zip

Posting a patch file with a code to my previous comment to give a general feeling of what I did.
Sorry, it has some other unrelated changes that are not needed for those fixed :(
In the end, I wouldn't merge this : )
But having deadlock detection in place it's easy to find all the deadlocks if you have a session under load.

@GooseBravo101
Copy link

Thank you for deeper investigation, caught 11 deadlocks in like an hour
Left it running for some time
attaching a log for 38 of them, maybe it would help with the investigation (leaving out all messages information)

logFile.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants