-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving a (re-usable) form definition with many components is unreasonably slow #5084
Labels
Milestone
Comments
sergei-maertens
added a commit
that referenced
this issue
Feb 5, 2025
The UPSERT query when sending all of the desired form variables leads to hefty queries (250ms for 1100 variables, 1100ms for \~7K), but more importantly, it results in large wall clock durations, presumably because of the data going over the wire and django processing the results of the bulk_create, which we aren't interested in. There doesn't appear to be a way to instruct django to ignore whatever happens after, so instead we perform the diffing on our end to only include the records that need to be created or updated.
sergei-maertens
added a commit
that referenced
this issue
Feb 5, 2025
The UPSERT query when sending all of the desired form variables leads to hefty queries (250ms for 1100 variables, 1100ms for \~7K), but more importantly, it results in large wall clock durations, presumably because of the data going over the wire and django processing the results of the bulk_create, which we aren't interested in. There doesn't appear to be a way to instruct django to ignore whatever happens after, so instead we perform the diffing on our end to only include the records that need to be created or updated.
sergei-maertens
added a commit
that referenced
this issue
Feb 5, 2025
When processing 1000 form variables (on my machine), the wall time was getting in the order of 10-20s, which is entirely unacceptable.
sergei-maertens
added a commit
that referenced
this issue
Feb 5, 2025
When processing 1000 form variables (on my machine), the wall time was getting in the order of 10-20s, which is entirely unacceptable.
sergei-maertens
added a commit
that referenced
this issue
Feb 5, 2025
The UPSERT query when sending all of the desired form variables leads to hefty queries (250ms for 1100 variables, 1100ms for \~7K), but more importantly, it results in large wall clock durations, presumably because of the data going over the wire and django processing the results of the bulk_create, which we aren't interested in. There doesn't appear to be a way to instruct django to ignore whatever happens after, so instead we perform the diffing on our end to only include the records that need to be created or updated.
sergei-maertens
added a commit
that referenced
this issue
Feb 5, 2025
When processing 1000 form variables (on my machine), the wall time was getting in the order of 10-20s, which is entirely unacceptable.
sergei-maertens
added a commit
that referenced
this issue
Feb 5, 2025
The UPSERT query when sending all of the desired form variables leads to hefty queries (250ms for 1100 variables, 1100ms for \~7K), but more importantly, it results in large wall clock durations, presumably because of the data going over the wire and django processing the results of the bulk_create, which we aren't interested in. There doesn't appear to be a way to instruct django to ignore whatever happens after, so instead we perform the diffing on our end to only include the records that need to be created or updated.
sergei-maertens
added a commit
that referenced
this issue
Feb 5, 2025
We noticed 3min wall clock time when updating a form definition that results in 7000 form variables being updated (local dev environment), which is completely unacceptable. Isolating the culprit/hot code path revealed that for 1000 form variables about 10-12 seconds was being spent, so a conservative time limit of 5 seconds is set up in tests to catch future performance regressions. However, realistically this should really take less than a second, but nobody like flaky tests so we pick something in the middle to account for different hardware in CI. Updates with small diffs should even take substantially less time, as we can ignore work that needs not to be done and avoid creating instances in the first place.
sergei-maertens
added a commit
that referenced
this issue
Feb 5, 2025
There are a number of aspects to this performance patch. 1. Avoid database load The UPSERT query when sending all of the desired form variables leads to hefty queries (250ms for 1100 variables, 1100ms for ~7K). Splitting this up in work that can be ignore brings down the query duration to about 50-60ms for 1000 variables. 2. Avoid expensive formio definition processing By far the most expensive operation is scanning whether a component is inside an editgrid ("repeating group") through the utility component_in_editgrid, which was parsing the same configuration over and over again. Instead, we can use the already-existing flag of iter_components to not recurse into edit grids in the first place. This fixes most of the time spent in Python. 3. Replace deepcopy with shallow copy This is probably the most controversial one - when deepcopying a django model instance, it goes through all the motions to serialize it for picking, which means that it must figure out the reconstruction function to use and capture all the necessary data, and deepcopy recurses, which means it also goes into the related form_definition and thus the formio configuration object which is full of lists/dicts that are expensive to deep copy. The replacement with a shallow copy should be fine because: * we're not copying any existing database instances (pk=None) * all the kwargs in initial instantiation are calculated and provided explicitly, there is no reliance on mutation * when 'copying' the desired variables for each form, we assign the optimized form_id attribute and don't do any mutations, i.e. all operations remain on the shallow level This is covered with some tests to hopefully prevent future regressions. Other ideas considered: * don't store FormVariables in the database, but instead create them in memory on the fly. This will be a problem once we no longer store prefill configuration in the component, we will require actual DB instances. It's also not very intuitive * offload to celery again. This is what we initially patched as it was leading to race conditions and dead locks and general performance issues too. It's may also strangely affect existing submissions. Given the complexity and the performance gains, this was not further explored. On my local machine, this brings the worst case insert in the test (1000 form variables from a form definition with 1000 components) from 10+ seconds down to 400ms, so about a factor 25 improvement.
10 tasks
sergei-maertens
added a commit
that referenced
this issue
Feb 6, 2025
We noticed 3min wall clock time when updating a form definition that results in 7000 form variables being updated (local dev environment), which is completely unacceptable. Isolating the culprit/hot code path revealed that for 1000 form variables about 10-12 seconds was being spent, so a conservative time limit of 5 seconds is set up in tests to catch future performance regressions. However, realistically this should really take less than a second, but nobody like flaky tests so we pick something in the middle to account for different hardware in CI. Updates with small diffs should even take substantially less time, as we can ignore work that needs not to be done and avoid creating instances in the first place.
sergei-maertens
added a commit
that referenced
this issue
Feb 6, 2025
There are a number of aspects to this performance patch. 1. Avoid database load The UPSERT query when sending all of the desired form variables leads to hefty queries (250ms for 1100 variables, 1100ms for ~7K). Splitting this up in work that can be ignore brings down the query duration to about 50-60ms for 1000 variables. 2. Avoid expensive formio definition processing By far the most expensive operation is scanning whether a component is inside an editgrid ("repeating group") through the utility component_in_editgrid, which was parsing the same configuration over and over again. Instead, we can use the already-existing flag of iter_components to not recurse into edit grids in the first place. This fixes most of the time spent in Python. 3. Replace deepcopy with shallow copy This is probably the most controversial one - when deepcopying a django model instance, it goes through all the motions to serialize it for picking, which means that it must figure out the reconstruction function to use and capture all the necessary data, and deepcopy recurses, which means it also goes into the related form_definition and thus the formio configuration object which is full of lists/dicts that are expensive to deep copy. The replacement with a shallow copy should be fine because: * we're not copying any existing database instances (pk=None) * all the kwargs in initial instantiation are calculated and provided explicitly, there is no reliance on mutation * when 'copying' the desired variables for each form, we assign the optimized form_id attribute and don't do any mutations, i.e. all operations remain on the shallow level This is covered with some tests to hopefully prevent future regressions. Other ideas considered: * don't store FormVariables in the database, but instead create them in memory on the fly. This will be a problem once we no longer store prefill configuration in the component, we will require actual DB instances. It's also not very intuitive * offload to celery again. This is what we initially patched as it was leading to race conditions and dead locks and general performance issues too. It's may also strangely affect existing submissions. Given the complexity and the performance gains, this was not further explored. On my local machine, this brings the worst case insert in the test (1000 form variables from a form definition with 1000 components) from 10+ seconds down to 400ms, so about a factor 25 improvement.
sergei-maertens
added a commit
that referenced
this issue
Feb 6, 2025
There are a number of aspects to this performance patch. 1. Avoid database load The UPSERT query when sending all of the desired form variables leads to hefty queries (250ms for 1100 variables, 1100ms for ~7K). Splitting this up in work that can be ignore brings down the query duration to about 50-60ms for 1000 variables. 2. Avoid expensive formio definition processing By far the most expensive operation is scanning whether a component is inside an editgrid ("repeating group") through the utility component_in_editgrid, which was parsing the same configuration over and over again. Instead, we can use the already-existing flag of iter_components to not recurse into edit grids in the first place. This fixes most of the time spent in Python. 3. Replace deepcopy with shallow copy This is probably the most controversial one - when deepcopying a django model instance, it goes through all the motions to serialize it for picking, which means that it must figure out the reconstruction function to use and capture all the necessary data, and deepcopy recurses, which means it also goes into the related form_definition and thus the formio configuration object which is full of lists/dicts that are expensive to deep copy. The replacement with a shallow copy should be fine because: * we're not copying any existing database instances (pk=None) * all the kwargs in initial instantiation are calculated and provided explicitly, there is no reliance on mutation * when 'copying' the desired variables for each form, we assign the optimized form_id attribute and don't do any mutations, i.e. all operations remain on the shallow level This is covered with some tests to hopefully prevent future regressions. Other ideas considered: * don't store FormVariables in the database, but instead create them in memory on the fly. This will be a problem once we no longer store prefill configuration in the component, we will require actual DB instances. It's also not very intuitive * offload to celery again. This is what we initially patched as it was leading to race conditions and dead locks and general performance issues too. It's may also strangely affect existing submissions. Given the complexity and the performance gains, this was not further explored. On my local machine, this brings the worst case insert in the test (1000 form variables from a form definition with 1000 components) from 10+ seconds down to 400ms, so about a factor 25 improvement. Backport-of: #5085
sergei-maertens
added a commit
that referenced
this issue
Feb 6, 2025
There are a number of aspects to this performance patch. 1. Avoid database load The UPSERT query when sending all of the desired form variables leads to hefty queries (250ms for 1100 variables, 1100ms for ~7K). Splitting this up in work that can be ignore brings down the query duration to about 50-60ms for 1000 variables. 2. Avoid expensive formio definition processing By far the most expensive operation is scanning whether a component is inside an editgrid ("repeating group") through the utility component_in_editgrid, which was parsing the same configuration over and over again. Instead, we can use the already-existing flag of iter_components to not recurse into edit grids in the first place. This fixes most of the time spent in Python. 3. Replace deepcopy with shallow copy This is probably the most controversial one - when deepcopying a django model instance, it goes through all the motions to serialize it for picking, which means that it must figure out the reconstruction function to use and capture all the necessary data, and deepcopy recurses, which means it also goes into the related form_definition and thus the formio configuration object which is full of lists/dicts that are expensive to deep copy. The replacement with a shallow copy should be fine because: * we're not copying any existing database instances (pk=None) * all the kwargs in initial instantiation are calculated and provided explicitly, there is no reliance on mutation * when 'copying' the desired variables for each form, we assign the optimized form_id attribute and don't do any mutations, i.e. all operations remain on the shallow level This is covered with some tests to hopefully prevent future regressions. Other ideas considered: * don't store FormVariables in the database, but instead create them in memory on the fly. This will be a problem once we no longer store prefill configuration in the component, we will require actual DB instances. It's also not very intuitive * offload to celery again. This is what we initially patched as it was leading to race conditions and dead locks and general performance issues too. It's may also strangely affect existing submissions. Given the complexity and the performance gains, this was not further explored. On my local machine, this brings the worst case insert in the test (1000 form variables from a form definition with 1000 components) from 10+ seconds down to 400ms, so about a factor 25 improvement. Backport-of: #5085
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Product versie / Product version
3.0.2 / 2.8.4 / latest
Customer reference
DB test env
Omschrijf het probleem / Describe the bug
If you have a form with one or more form definitions with many components (observed 95 components), the PUT call in the API to update this is very slow (10s of seconds to minutes), which makes editing forms or form definitions a nightmare.
Stappen om te reproduceren / Steps to reproduce
No response
Verwacht gedrag / Expected behavior
It's fast ⚡ .
Screen resolution
None
Device
None
OS
None
Browser
No response
The text was updated successfully, but these errors were encountered: