Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce lock contention #5

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

JayKickliter
Copy link

@JayKickliter JayKickliter commented Jul 1, 2022

I put no effort to into optimizing the synchronization of multiple caches when first re-writing the native the portion of this module. Currently, all individual caches are stored in a single global map synchronized with a mutex. That mutex is locked for as long as any NIF is accessing any single cache, thus blocking any other NIFs wanting to access any other unrelated cache.

This PR attempts to reduce global lock contention by switching to a read-write lock instead of a mutex. The idea behind this is that new caches will be created infrequently, and most lookups of individual caches in the global cache-map will read only. Furthermore, each cache in the map is wrapped in an atomically reference counted (Arc) wrapper and read-write lock (RwLock). This allows us to hold on to the global lock only as long as needed to clone the target cache's ARC wrapper. Further more, non-mutating access to each cache only reaquires a read lock, thus allowing other NIF calls concurrent read-access to the same cache.

Outstanding work

This code will not compile as-is because the OwnedBinarys used for both the global map keys and cache keys are not Sync. I need to either A) get upstream rustler to imply Sync for OwnedBinary or B) switch to using Vec<u8> for key-value pairs. (B) may cause additional data copy overhead, but I'm a little fuzzy on the overhead converting between ErlNIfBinary and Binary terms.

@JayKickliter JayKickliter force-pushed the jsk/reduce-lock-contention branch from 725cd5c to f841b5c Compare July 1, 2022 17:59
@JayKickliter JayKickliter requested review from madninja, xandkar and dpezely and removed request for madninja and xandkar July 1, 2022 18:00
@JayKickliter JayKickliter force-pushed the jsk/reduce-lock-contention branch from f841b5c to 0eacfce Compare July 1, 2022 18:04
native/lib.rs Show resolved Hide resolved
@Vagabond
Copy link

Vagabond commented Jul 1, 2022

Keeping binaries as binary references is probably important

@Vagabond
Copy link

Vagabond commented Jul 1, 2022

Having each cache be its own Resource that we could track the reference to, and not having the NIF track the list of Resources would also be fine, FWIW. We could track the name -> Resource mapping in the Erlang part of the NIF to retain API compat.

@JayKickliter
Copy link
Author

JayKickliter commented Jul 1, 2022

Having each cache be its own Resource that we could track the reference to, and not having the NIF track the list of Resources would also be fine, FWIW.

Yes, that would be ideal.

EDIT: on second thought, I'm not sure if doing a 'cache_name_atom' to cache lookup in ETS would be any faster than the current scheme.

@JayKickliter JayKickliter requested a review from marcsugiyama July 5, 2022 23:41
Copy link

@marcsugiyama marcsugiyama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the caveat that I don't know Rust... The concept behind the changes seems sound to me. My only concern was possible exclusive lock starvation, but it looks like the RwLock implementation favors write locks to avoid this (once a write lock is requested, no read locks are granted until after the write lock is granted).

native/lib.rs Outdated
.expect("MinQ1Size/MaxSize ratio doesn't appear to make sense");
let cache = Arc::new(RwLock::new(TermCache::new(inner)));
CACHES
.write()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this trying to get a write lock on CACHES without releasing the read lock?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks like it's creating a new cache for put if one doesn't exist. The code here differs slightly from the code for create. Can the cache craeting get moved to a utility function that both create and put use?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this trying to get a write lock on CACHES without releasing the read lock?

It may not be obvious, but the read lock acquired on is a temporary and dropped at the end of line 65. Elsewhere in the code I explicitly drop read locks instead of waiting for them to drop at the end of scope:

 drop(some_lock);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the cache craeting get moved to a utility function that both create and put use?

Probably. Maybe there's a valid reason I didn't do that, but it's likely an oversight.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be obvious, but the read lock acquired on is a temporary and dropped at the end of line 65

Got it. The lock gets dropped when the value goes out of scope. This is my non-knowledge of Rust getting in the way...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up: Marc's intuition is correct and there is in fact a deadlock here. It is fixed with a recent commit.

@marcsugiyama
Copy link

Having each cache be its own Resource that we could track the reference to, and not having the NIF track the list of Resources would also be fine, FWIW.

Yes, that would be ideal.

EDIT: on second thought, I'm not sure if doing a 'cache_name_atom' to cache lookup in ETS would be any faster than the current scheme.

You could use persistent term for the mapping and some code on the Erlang side to map the atom cache name to a NIF reference to pass to the Rust NIF. If the value is a NIF reference, then there's no global GC when the key is deleted (say to delete the cache). This would require changing the Rust code to use a NIF ref to somehow point to a particular cache. That way, we don't need a hashmap in the Rust code to dereference the atom identifier for the cache.

Drop live readlocks before creating writelocks.
let cache_name = HBin::from_atom(env, cache_name);
if caches.get(&cache_name).is_some() {
let caches = CACHES.read().unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think grabbing the read lock, dropping the lock, and then acquiring the write lock introduces a race condition where two threads can try to insert the same cache at the same time. There's probably not real consequence to that for create. Maybe use an atomic read -> write upgradable lock or start with a write lock.

// deadlock when creating the upcoming write lock.
drop(caches);
let mut caches = CACHES.write().unwrap();
caches.insert(HBin::from_atom(env, cache_name), cache.clone());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's the possibility of a race condition here where the put first discovers there is no cache under the read lock, drops the lock, and another thread in put or create adds the cache. The insert here that happens next erases the previous cache, possibly causing the loss of a cached item. Since the cache is an optimization, this isn't a correctness problem but it could be a minor performance problem.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in create an upgradable lock or starting with a write lock might be better.

@marcsugiyama
Copy link

The PR improves concurrency with respect to the number of caches:
master branch:

type CacheCount Worker/cache Iterations/Worker cache/mills
slow          1            1              100k       539.2
slow          1            1              100k       563.3
slow          1            1              100k       570.7
slow          1            2              100k       875.0
slow          1            2              100k       848.5
slow          1            2              100k       861.5
slow          1            4              100k       929.6
slow          1            4              100k       949.4
slow          1            4              100k       958.8
slow          1            8              100k       644.7
slow          1            8              100k       751.8
slow          1            8              100k       708.3
slow          1           16              100k       588.1
slow          1           16              100k       594.8
slow          1           16              100k       586.1
slow          2            1              100k       839.2
slow          2            1              100k       845.2
slow          2            1              100k       832.9
slow          2            2              100k       934.5
slow          2            2              100k       935.8
slow          2            2              100k       932.7
slow          2            4              100k       591.2
slow          2            4              100k       580.0
slow          2            4              100k       804.2
slow          2            8              100k       586.1
slow          2            8              100k       555.8
slow          2            8              100k       556.9
slow          2           16              100k       731.9
slow          2           16              100k       889.7
slow          2           16              100k       985.5
slow          4            1              100k       918.3
slow          4            1              100k       901.6
slow          4            1              100k       712.2
slow          4            2              100k       691.8
slow          4            2              100k       628.2
slow          4            2              100k       595.8
slow          4            4              100k       692.1
slow          4            4              100k       551.3
slow          4            4              100k       528.3
slow          4            8              100k       785.4
slow          4            8              100k       764.7
slow          4            8              100k       649.7
slow          4           16              100k       875.6
slow          4           16              100k       643.9
slow          4           16              100k       762.4

jsk/reduce-lock-contention branch:

type CacheCount Worker/cache Iterations/Worker cache/mills
slow          1            1              100k       526.9
slow          1            1              100k       536.2
slow          1            1              100k       532.9
slow          1            2              100k       872.1
slow          1            2              100k       857.8
slow          1            2              100k       858.7
slow          1            4              100k       374.9
slow          1            4              100k       368.1
slow          1            4              100k       356.6
slow          1            8              100k       371.2
slow          1            8              100k       371.3
slow          1            8              100k       372.2
slow          1           16              100k       366.4
slow          1           16              100k       365.3
slow          1           16              100k       365.0
slow          2            1              100k      1042.4
slow          2            1              100k      1047.1
slow          2            1              100k      1059.4
slow          2            2              100k      1442.6
slow          2            2              100k      1376.5
slow          2            2              100k      1262.9
slow          2            4              100k       647.7
slow          2            4              100k       657.0
slow          2            4              100k       658.5
slow          2            8              100k       675.9
slow          2            8              100k       666.3
slow          2            8              100k       670.4
slow          2           16              100k       690.4
slow          2           16              100k       676.8
slow          2           16              100k       678.0
slow          4            1              100k      1864.7
slow          4            1              100k      1953.4
slow          4            1              100k      1889.9
slow          4            2              100k      1036.3
slow          4            2              100k      1169.7
slow          4            2              100k      1150.9
slow          4            4              100k      1048.7
slow          4            4              100k      1023.5
slow          4            4              100k       990.3
slow          4            8              100k      1070.8
slow          4            8              100k      1003.9
slow          4            8              100k       956.6
slow          4           16              100k       948.1
slow          4           16              100k       915.0
slow          4           16              100k       995.3

However, the branch scalability with respect to workers per cache is poorer than master with 4 or more cores.

@marcsugiyama
Copy link

Benchmark test is in this branch sugiyama/concurrency-test

@marcsugiyama
Copy link

Prototype to use mutex but yield if the lock is taken scales pretty well to around four concurrent caches/workers. Prototype code in sugiyama/concurrent-test branch.

type CacheCount Worker/cache Iterations/Worker cache/mills
slow          1            1              100k       563.5
slow          1            1              100k       556.7
slow          1            1              100k       566.8
slow          1            2              100k      1117.0
slow          1            2              100k      1101.4
slow          1            2              100k      1087.1
slow          1            4              100k      1818.0
slow          1            4              100k      1890.2
slow          1            4              100k      2001.1
slow          1            8              100k      1542.5
slow          1            8              100k      1531.3
slow          1            8              100k      1582.0
slow          1           16              100k      1274.8
slow          1           16              100k      1282.7
slow          1           16              100k      1303.3
slow          2            1              100k      1060.9
slow          2            1              100k      1042.5
slow          2            1              100k      1057.3
slow          2            2              100k      1750.8
slow          2            2              100k      1699.1
slow          2            2              100k      1763.5
slow          2            4              100k      1457.7
slow          2            4              100k      1478.6
slow          2            4              100k      1447.0
slow          2            8              100k      1227.2
slow          2            8              100k      1231.8
slow          2            8              100k      1226.1
slow          2           16              100k      1196.6
slow          2           16              100k      1185.2
slow          2           16              100k      1184.8
slow          4            1              100k      1769.4
slow          4            1              100k      1809.3
slow          4            1              100k      1879.4
slow          4            2              100k      1467.5
slow          4            2              100k      1502.1
slow          4            2              100k      1486.5
slow          4            4              100k      1241.0
slow          4            4              100k      1220.7
slow          4            4              100k      1208.7
slow          4            8              100k      1183.2
slow          4            8              100k      1195.3
slow          4            8              100k      1171.8
slow          4           16              100k      1113.3
slow          4           16              100k      1117.6
slow          4           16              100k      1092.9

@marcsugiyama
Copy link

Prototype to use a Mutex instead of a RwLock on the cache (not the cache map) on this PR improves the performance of the PR:

type CacheCount Worker/cache Iterations/Worker cache/mills
slow          1            1              100k       519.0
slow          1            1              100k       539.6
slow          1            1              100k       552.1
slow          1            2              100k      1008.3
slow          1            2              100k      1011.0
slow          1            2              100k       991.8
slow          1            4              100k      1343.3
slow          1            4              100k      1373.2
slow          1            4              100k      1401.4
slow          1            8              100k       757.1
slow          1            8              100k       760.3
slow          1            8              100k       788.3
slow          1           16              100k       722.8
slow          1           16              100k       723.5
slow          1           16              100k       722.8
slow          2            1              100k      1043.2
slow          2            1              100k      1044.1
slow          2            1              100k      1058.2
slow          2            2              100k      1804.1
slow          2            2              100k      1769.2
slow          2            2              100k      1846.2
slow          2            4              100k      1344.9
slow          2            4              100k      1408.7
slow          2            4              100k      1360.1
slow          2            8              100k      1227.9
slow          2            8              100k      1240.0
slow          2            8              100k      1264.6
slow          2           16              100k      1200.8
slow          2           16              100k      1178.5
slow          2           16              100k      1177.6
slow          4            1              100k      1941.1
slow          4            1              100k      1915.2
slow          4            1              100k      1823.8
slow          4            2              100k      2273.9
slow          4            2              100k      2237.7
slow          4            2              100k      2251.8
slow          4            4              100k      1970.8
slow          4            4              100k      1936.1
slow          4            4              100k      1889.4
slow          4            8              100k      1755.1
slow          4            8              100k      1800.8
slow          4            8              100k      1793.7
slow          4           16              100k      1643.4
slow          4           16              100k      1703.1
slow          4           16              100k      1712.0

@marcsugiyama
Copy link

https://github.com/marcsugiyama accidentally closed the branch. Reopenning.

@marcsugiyama marcsugiyama reopened this Jul 26, 2022
@marcsugiyama
Copy link

Further prototypes to improve the performance further:

  1. On the cache id hashmap, use RwLock and yield if the lock is taken
  2. On the cache lock, use Mutex and yield if the lock is taken

Even with these changes, it's not clear if the scalability of e2qc will match cream under high load. cream has no locks to lookup the cache id and is based on the highly concurrent moka cache. Using an RwLock for the cache id lookup in e2qc likely eliminates blocking on this lookup as writes locks are only needed when a cache is created or destroyed, and caches are likely created during system initialization. An API change for e2qc can eliminate the cache id lookup lock, but we'd still be left with an exclusive lock on the cache under most circumstances. A yield if the cache is locked avoids blocking in Rust, but we suffer a round trip through Erlang scheduling.

Whether cream is better than e2qc depends on the use case. miner seems to have several different caches, which makes cream better than the version of e2qc with a single mutex for all caches. It's not clear now many workers there are per cache. Scalability of cream seems to be better than e2qc with respect to the number of workers per cache.

Depending on the usage pattern, 2qc may not be the right choice anyway. 2qc is designed for a specific relational database pattern which is trying to balance caching small tables/index pages against churning through buffers in a table scan. e2qc needs an exclusive lock on most cache operations because the LRU chain is updated. cream uses moka which implements synchronization within the library so it's behavior is not obvious. moka documentation suggests it behaves like a traditional LRU cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants