Improve metadata bwc test for Logical Replication #354

jeeminso · 2025-06-01T00:55:36Z

Summary of the changes / Why this is an improvement

Checklist

Link to issue this PR refers to (if applicable): Fixes #???

jeeminso · 2025-06-03T16:12:29Z

tests/bwc/test_rolling_upgrade.py

+            # Set up tables for logical replications
+            if int(path.from_version.split('.')[0]) >= 5 and int(path.from_version.split('.')[1]) >= 10:
+                c.execute("create table doc.x (a int) clustered into 1 shards with (number_of_replicas=0)")
+                expected_active_shards += 1
+                c.execute("create publication p for table doc.x")
+                with connect(replica_cluster.node().http_url, error_trace=True) as replica_conn:
+                    rc = replica_conn.cursor()
+                    rc.execute("create table doc.rx (a int) clustered into 1 shards with (number_of_replicas=0)")
+                    rc.execute("create publication rp for table doc.rx")
+                    rc.execute(f"create subscription rs connection 'crate://localhost:{cluster.node().addresses.transport.port}?user=crate&sslmode=sniff' publication p")
+                    wait_for_active_shards(rc)
+                c.execute(f"create subscription s connection 'crate://localhost:{replica_cluster.node().addresses.transport.port}?user=crate&sslmode=sniff' publication rp")
+                wait_for_active_shards(c)


If I remove the calls to wait_for_active_shards and move onto rolling upgrades immediately, I observe unexpected behaviours like UnavailableShardsException or the number of rows replicated do not add up correctly. But to my knowledge, users are recommended to wait for active shards before upgrading, so this is not an issue?

tests/bwc/test_rolling_upgrade.py

jeeminso · 2025-06-05T02:20:51Z

tests/bwc/test_rolling_upgrade.py

+                        c.execute("insert into doc.x values (1)")
+                        rc.execute("insert into doc.rx values (1)")
+
+                        rc.execute("select count(*) from doc.x")


crate.client.exceptions.ProgrammingError: RelationUnknown[Relation 'doc.x' unknown] io.crate.exceptions.RelationUnknown: Relation 'doc.x' unknown at io.crate.exceptions.RelationUnknown.of(RelationUnknown.java:46)

Guessing it means that the DROP stmt succeeded, looking into it.

jeeminso · 2025-06-05T23:11:20Z

The first commit tests LR during rolling upgrade 5.10 > jeeminso/temp and the second commit tests 5.10 > branch:master where the first passes and the second fails indicating that there is a regression caused by crate/crate#17960.

Logs from the first commit: https://jenkins.crate.io/job/CrateDB/job/qa/job/crate_qa_on_pr/675/execution/node/191/log/
Logs for the second commit: https://jenkins.crate.io/job/CrateDB/job/qa/job/crate_qa_on_pr/676/execution/node/198/log/

Hi @seut could you take a look? BTW, this problem is intermittent especially when tried to reproduce manually.

seut · 2025-06-06T12:21:02Z

@jeeminso
Thanks for this info, I'll have a look into this asap.

seut · 2025-06-11T13:07:16Z

I've looked into this and ran the related tests multiple times locally.

its flaky, it succeeds more often than it fails
if it fails, it mostly fails on the replica_cluster-> drop-replicated-table which runs 5.10.x. (as of this time, 5.10.9). During the restart of the publication cluster, due to the rolling upgrade, at one point the replicated table can be dropped even that the logs indicating that the subscription (and thus the tracker) is still running.

I do not understand yet what the real issue is but it feels very timing related. Such it does not break in general, all manual test I did are working as expected. I can also not see why #17960 would cause such issue, I think this was just coincidence and the same flaky failure can may be seen even without this change, also I did not test this.

I'd followup on this at a later point to debug it more deeply, if no one else figured this out until then.

jeeminso force-pushed the jeeminso/lr branch from 55ebb8b to 9d00462 Compare June 3, 2025 15:49

jeeminso commented Jun 3, 2025

View reviewed changes

tests/bwc/test_rolling_upgrade.py Outdated Show resolved Hide resolved

jeeminso force-pushed the jeeminso/lr branch 3 times, most recently from a47df1d to e4b1b28 Compare June 4, 2025 21:30

This comment was marked as resolved.

Sign in to view

jeeminso force-pushed the jeeminso/lr branch from e4b1b28 to 79a0919 Compare June 5, 2025 01:20

jeeminso commented Jun 5, 2025

View reviewed changes

jeeminso force-pushed the jeeminso/lr branch from 79a0919 to 028be19 Compare June 5, 2025 20:58

jeeminso mentioned this pull request Jun 11, 2025

Test: verify num_docs from sys.shards accounts for docs replicated via logical replication crate/crate#18005

Closed

4 tasks

jeeminso force-pushed the jeeminso/lr branch 2 times, most recently from 5aa84d9 to 180e243 Compare June 11, 2025 22:20

mfussenegger added this to CrateDB 6.0 Jun 17, 2025

mfussenegger moved this to Must in CrateDB 6.0 Jun 17, 2025

jeeminso mentioned this pull request Jul 16, 2025

Fix intermittent subscription loss in logical replication during rolling upgrades crate/crate#18175

Merged

4 tasks

jeeminso force-pushed the jeeminso/lr branch 2 times, most recently from 7aa452e to ac98688 Compare July 18, 2025 15:57

Improve metadata bwc test for Logical Replication

297945b

jeeminso force-pushed the jeeminso/lr branch from ac98688 to 297945b Compare July 18, 2025 16:00

mergify bot mentioned this pull request Jul 21, 2025

Fix intermittent subscription loss in logical replication during rolling upgrades (backport #18175) crate/crate#18198

Merged

4 tasks

jeeminso added 2 commits July 25, 2025 12:40

temp

c95f548

Run with 6.0 instead of 6.0.x

4dfc293

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve metadata bwc test for Logical Replication #354

Improve metadata bwc test for Logical Replication #354

Uh oh!

jeeminso commented Jun 1, 2025

Uh oh!

jeeminso Jun 3, 2025

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

jeeminso Jun 5, 2025

Uh oh!

jeeminso commented Jun 5, 2025

Uh oh!

seut commented Jun 6, 2025

Uh oh!

seut commented Jun 11, 2025

Uh oh!

Uh oh!

Improve metadata bwc test for Logical Replication #354

Are you sure you want to change the base?

Improve metadata bwc test for Logical Replication #354

Uh oh!

Conversation

jeeminso commented Jun 1, 2025

Summary of the changes / Why this is an improvement

Checklist

Uh oh!

jeeminso Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

jeeminso Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

jeeminso commented Jun 5, 2025

Uh oh!

seut commented Jun 6, 2025

Uh oh!

seut commented Jun 11, 2025

Uh oh!

Uh oh!