Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast sync MySQL runs into deadlocks #412

Open
JasonSanDiego opened this issue May 13, 2020 · 8 comments
Open

fast sync MySQL runs into deadlocks #412

JasonSanDiego opened this issue May 13, 2020 · 8 comments
Labels
bug Something isn't working

Comments

@JasonSanDiego
Copy link

I am trying to sync MySQL to Redshift and finding that fast sync fails due to deadlocks. We are pointing pipelinewise at a MySQL read replica of a fairly active production DB. Unfortunately, the pipelinewise job fails pretty consistently while trying to fast sync one of our larger tables.

Is there some way to disable fast sync and use traditional singer sync only? I couldn't find any way in the documentation, and I'm now crawling through the code looking for a way to comment it out as a test.

@JasonSanDiego
Copy link
Author

As in my other issue, I worked around this by disabling fast sync manually in the code to force traditional singer sync, which worked fine, presumably due to working in much smaller batches.

It would be nice to have a way to disable fast sync on certain tables (or globally) for these situations with a large table on an actively used database.

@koszti
Copy link
Contributor

koszti commented May 28, 2020

Maybe you're right, but would be also great to know the what's causing the deadlock in Redshift.
FastSync is loading tables in parallel and using the same number of processes as the number of CPU cores found in the system. This is a subject to change and we'd like to introduce the parallelism and max_parallelism options for FastSync that's available in every PPW target components.

Btw, the redshift connections are not shared in the fastsync processes, do you think that'd cause any issue in Redshift?

Also, do you have a specific error message from Redshift and do you have this problem only in very active tables?

I'd like to reproduce this problem.

@JasonSanDiego
Copy link
Author

JasonSanDiego commented May 28, 2020 via email

@louis-pie louis-pie added the bug Something isn't working label Aug 19, 2020
@danielerapati
Copy link

Hi,
we had a very similar problem with random Deadlocks on the source systems (different Postgres read replicas on RDS).
We worked around it by deactivating multiprocessing here: https://github.com/transferwise/pipelinewise/blob/master/pipelinewise/fastsync/postgres_to_redshift.py#L172 (there is a tradeoff: the tap runs sequentially and is much slower, as if it had only one cpu available)
So what I think is happening is the source postgres (or MySQL, that multiprocessing call seems to be in every fastsync tap) is configured in such a way that multiple pipelinewise connections at the same time trip on each other.
This problem seems to be specific to fastsync but I could not find out what specific db operation fastsync sync_table() is doing that cause it to acquire an AccessExclusiveLock on a database resource it shares with other sync_table() calls.

@koszti
Copy link
Contributor

koszti commented Jan 21, 2021

FastSync parallelism level is now configurable by a new fastsync_parallelism parameter. Example YAML is here. It's currently available in the master branch and will be released soon as part of PPW 0.30.0.

The tradeoff mentioned by @danielerapati is valid and the performance drops in sync with lowering the fastsync parallelism level.

Would be nice if we could reproduce the problem somehow to see what's causing the deadlock in the source db.

@ers81239
Copy link

@JasonSanDiego , can you share with me your code change to disable fastsync? I'm experiencing a similar issue and managed to 'fool' pipelinewise into not fastsyncing most of my tables by creating bookmarks in state.json. But for some reason it still attempts fast sync for a few tables.

@ers81239
Copy link

For anyone else encountering this issue, I found a way to disable fastsync:

In pipelinewise.py, in the run_tap function, find this line (line 1003 at this time):

if len(fastsync_stream_ids) > 0:

And change it to:

if 1 == 0:

@mhindery
Copy link
Contributor

As we also have run into issues with fastsync, I have started a PR to allow turning off fastsync without having to edit and install custom source code, so this should become easier in the future: #697

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants