fast sync MySQL runs into deadlocks #412

JasonSanDiego · 2020-05-13T21:14:32Z

I am trying to sync MySQL to Redshift and finding that fast sync fails due to deadlocks. We are pointing pipelinewise at a MySQL read replica of a fairly active production DB. Unfortunately, the pipelinewise job fails pretty consistently while trying to fast sync one of our larger tables.

Is there some way to disable fast sync and use traditional singer sync only? I couldn't find any way in the documentation, and I'm now crawling through the code looking for a way to comment it out as a test.

JasonSanDiego · 2020-05-21T03:23:00Z

As in my other issue, I worked around this by disabling fast sync manually in the code to force traditional singer sync, which worked fine, presumably due to working in much smaller batches.

It would be nice to have a way to disable fast sync on certain tables (or globally) for these situations with a large table on an actively used database.

koszti · 2020-05-28T09:05:38Z

Maybe you're right, but would be also great to know the what's causing the deadlock in Redshift.
FastSync is loading tables in parallel and using the same number of processes as the number of CPU cores found in the system. This is a subject to change and we'd like to introduce the parallelism and max_parallelism options for FastSync that's available in every PPW target components.

Btw, the redshift connections are not shared in the fastsync processes, do you think that'd cause any issue in Redshift?

Also, do you have a specific error message from Redshift and do you have this problem only in very active tables?

I'd like to reproduce this problem.

JasonSanDiego · 2020-05-28T13:28:23Z

Sorry, I realized I was not specific enough in my original issue report. The deadlock is happening on the MySQL side when selecting the data. We are actually reading from an AWS read replica of a fairly but not hugely active table (just guessing, but maybe a few updates per minute). This read replica was configured to itself have row replication enabled so that PPW can use it as a source. I just skimmed the fast sync code at the time I was having the issue, but it seems like it works by selecting large chunks of the data. My assumption is that an UPDATE is happening at the same time and deadlocking. This doesn’t seem like an easy problem to solve. I read into isolation levels, but it seems a little risky - if the fast sync gets a dirty read, will a future Singer incremental sync pick up the change from row replication?

…

On Thu, May 28, 2020 at 2:05 AM Peter Kosztolanyi ***@***.***> wrote: Maybe you're right, but would be also great to know the what's causing the deadlock in Redshift. FastSync is loading tables in parallel and using the same number of processes as the number of CPU cores found in the system. This is a subject to change and we'd like to introduce the parallelism and max_parallelism options for FastSync that's available in every PPW target components. Btw, the redshift connections are not shared in the fastsync processes. - Do you think that'd cause any issue in Redshift? - Do you have a specific error message from Redshift? - Do you have this problem only in very active tables? I'd like to reproduce this problem. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#412 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKA3SV5OYWXRPOC3KR34BTRTYSPBANCNFSM4NAD5OCA> .

danielerapati · 2021-01-21T18:01:28Z

Hi,
we had a very similar problem with random Deadlocks on the source systems (different Postgres read replicas on RDS).
We worked around it by deactivating multiprocessing here: https://github.com/transferwise/pipelinewise/blob/master/pipelinewise/fastsync/postgres_to_redshift.py#L172 (there is a tradeoff: the tap runs sequentially and is much slower, as if it had only one cpu available)
So what I think is happening is the source postgres (or MySQL, that multiprocessing call seems to be in every fastsync tap) is configured in such a way that multiple pipelinewise connections at the same time trip on each other.
This problem seems to be specific to fastsync but I could not find out what specific db operation fastsync sync_table() is doing that cause it to acquire an AccessExclusiveLock on a database resource it shares with other sync_table() calls.

koszti · 2021-01-21T18:44:48Z

FastSync parallelism level is now configurable by a new fastsync_parallelism parameter. Example YAML is here. It's currently available in the master branch and will be released soon as part of PPW 0.30.0.

The tradeoff mentioned by @danielerapati is valid and the performance drops in sync with lowering the fastsync parallelism level.

Would be nice if we could reproduce the problem somehow to see what's causing the deadlock in the source db.

ers81239 · 2021-03-31T14:18:38Z

@JasonSanDiego , can you share with me your code change to disable fastsync? I'm experiencing a similar issue and managed to 'fool' pipelinewise into not fastsyncing most of my tables by creating bookmarks in state.json. But for some reason it still attempts fast sync for a few tables.

ers81239 · 2021-03-31T15:43:39Z

For anyone else encountering this issue, I found a way to disable fastsync:

In pipelinewise.py, in the run_tap function, find this line (line 1003 at this time):

if len(fastsync_stream_ids) > 0:

And change it to:

if 1 == 0:

mhindery · 2021-04-21T18:02:34Z

As we also have run into issues with fastsync, I have started a PR to allow turning off fastsync without having to edit and install custom source code, so this should become easier in the future: #697

louis-pie added the bug Something isn't working label Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast sync MySQL runs into deadlocks #412

fast sync MySQL runs into deadlocks #412

JasonSanDiego commented May 13, 2020

JasonSanDiego commented May 21, 2020

koszti commented May 28, 2020 •

edited

Loading

JasonSanDiego commented May 28, 2020 via email

danielerapati commented Jan 21, 2021

koszti commented Jan 21, 2021 •

edited

Loading

ers81239 commented Mar 31, 2021

ers81239 commented Mar 31, 2021

mhindery commented Apr 21, 2021

fast sync MySQL runs into deadlocks #412

fast sync MySQL runs into deadlocks #412

Comments

JasonSanDiego commented May 13, 2020

JasonSanDiego commented May 21, 2020

koszti commented May 28, 2020 • edited Loading

JasonSanDiego commented May 28, 2020 via email

danielerapati commented Jan 21, 2021

koszti commented Jan 21, 2021 • edited Loading

ers81239 commented Mar 31, 2021

ers81239 commented Mar 31, 2021

mhindery commented Apr 21, 2021

koszti commented May 28, 2020 •

edited

Loading

koszti commented Jan 21, 2021 •

edited

Loading