Skip to content

Improve metadata bwc test for Logical Replication #354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jeeminso
Copy link
Contributor

@jeeminso jeeminso commented Jun 1, 2025

Summary of the changes / Why this is an improvement

Checklist

  • Link to issue this PR refers to (if applicable): Fixes #???

Comment on lines 161 to 173
# Set up tables for logical replications
if int(path.from_version.split('.')[0]) >= 5 and int(path.from_version.split('.')[1]) >= 10:
c.execute("create table doc.x (a int) clustered into 1 shards with (number_of_replicas=0)")
expected_active_shards += 1
c.execute("create publication p for table doc.x")
with connect(replica_cluster.node().http_url, error_trace=True) as replica_conn:
rc = replica_conn.cursor()
rc.execute("create table doc.rx (a int) clustered into 1 shards with (number_of_replicas=0)")
rc.execute("create publication rp for table doc.rx")
rc.execute(f"create subscription rs connection 'crate://localhost:{cluster.node().addresses.transport.port}?user=crate&sslmode=sniff' publication p")
wait_for_active_shards(rc)
c.execute(f"create subscription s connection 'crate://localhost:{replica_cluster.node().addresses.transport.port}?user=crate&sslmode=sniff' publication rp")
wait_for_active_shards(c)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove the calls to wait_for_active_shards and move onto rolling upgrades immediately, I observe unexpected behaviours like UnavailableShardsException or the number of rows replicated do not add up correctly. But to my knowledge, users are recommended to wait for active shards before upgrading, so this is not an issue?

Comment on lines 318 to 319
# account for replication delay, wait_for_active_shards nor REFRESH help here
time.sleep(5)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better option for this?

Copy link
Member

@seut seut Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, use assert_busy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jeeminso jeeminso force-pushed the jeeminso/lr branch 3 times, most recently from a47df1d to e4b1b28 Compare June 4, 2025 21:30
@jeeminso
Copy link
Contributor Author

jeeminso commented Jun 5, 2025

this is unexpected behaviour from a logically replicated shard - select sum(num_docs) from sys.shards where schema_name = 'doc' and table_name = 'x'; and select count(*) from doc.x; are out of sync for a noticeable amount of time and it gives me the suspicion that invoking the latter query forces former to be updated:

cr> select sum(num_docs) from sys.shards where schema_name = 'doc' and table_name = 'x';
+---------------+
| sum(num_docs) |
+---------------+
|             1 |
+---------------+
SELECT 1 row in set (0.003 sec)
cr> select sum(num_docs) from sys.shards where schema_name = 'doc' and table_name = 'x';
+---------------+
| sum(num_docs) |
+---------------+
|             1 |
+---------------+
SELECT 1 row in set (0.003 sec)
cr> select count(*) from doc.x;
+----------+
| count(*) |
+----------+
|        2 |
+----------+
SELECT 1 row in set (0.684 sec)
cr> select sum(num_docs) from sys.shards where schema_name = 'doc' and table_name = 'x';
+---------------+
| sum(num_docs) |
+---------------+
|             2 |
+---------------+
SELECT 1 row in set (0.003 sec)

@jeeminso
Copy link
Contributor Author

jeeminso commented Jun 5, 2025

It is an intermittent behaviour(was able to reproduce but rarely on the latest master from 1 node cluster to 1 node cluster) that does seem select count(*) from doc.x; to cause select sum(num_docs) from sys.shards where schema_name = 'doc' and table_name = 'x'; to reflect the latest insert.

c.execute("insert into doc.x values (1)")
rc.execute("insert into doc.rx values (1)")

rc.execute("select count(*) from doc.x")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crate.client.exceptions.ProgrammingError: RelationUnknown[Relation 'doc.x' unknown]
io.crate.exceptions.RelationUnknown: Relation 'doc.x' unknown
	at io.crate.exceptions.RelationUnknown.of(RelationUnknown.java:46)

Guessing it means that the DROP stmt succeeded, looking into it.

@jeeminso
Copy link
Contributor Author

jeeminso commented Jun 5, 2025

The first commit tests LR during rolling upgrade 5.10 > jeeminso/temp and the second commit tests 5.10 > branch:master where the first passes and the second fails indicating that there is a regression caused by crate/crate#17960.

Hi @seut could you take a look? BTW, this problem is intermittent especially when tried to reproduce manually.

@seut
Copy link
Member

seut commented Jun 6, 2025

@jeeminso
Thanks for this info, I'll have a look into this asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants