-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 6553 - Update concread to 0.5.4 and refactor statistics tracking #6607
Conversation
@Firstyear hi! Could you please take a look at the PR, as I'm not sure how to get: I've tried to Also, note that I created |
src/librslapd/src/cache.rs
Outdated
unsafe { | ||
let _drop = Box::from_raw(read_txn); | ||
let read_txn_box = Box::from_raw(read_txn); | ||
read_txn_box.cache.inner.try_quiesce(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to do try_quiesce_stats(); and you need to do it after you call .finish.
I think you may be misunderstanding how the stats api is working. The stats struct you pass in is just triggered on those events. The cache itself tracks nothing. You need to track all the state and values. So this: https://github.com/389ds/389-ds-base/pull/6607/files#diff-3292c7c1da4d212ee8ca277908c4b63bdd14216f92e4ea623782cd3f65fadad4R224 does nothing because between the start of the txn to the commit you never read or write from the txn, so the stats shows zero - because nothing was done during the transaction!!!! You actually need to be putting in a stats tracker https://github.com/389ds/389-ds-base/pull/6607/files#diff-3292c7c1da4d212ee8ca277908c4b63bdd14216f92e4ea623782cd3f65fadad4R257 and then using that to adjust your hit/load amounts as they change. Because the quiesce step is where the read transaction inclusions/evictions are actually triggered. PS: Your cache stats don't actually need to be a struct of Atomics, you can make it a Struct inside a CowCell and then when someone wants to view the stats you read it, when a thread wants to update you take a write on the cow cell. That way it never blocks anything. |
4cc2813
to
bd67aa8
Compare
Sure, I totally understand the last part. stats.p_weight(inner.p as u64);
stats.shared_max(shared.max as u64);
stats.freq(inner.freq.len() as u64);
stats.recent(inner.rec.len() as u64);
stats.all_seen_keys(cache.len() as u64); IIUC, it's safe to get them only on the
That was my mistake; I missed that a simple
Fixed. Please, review! |
Still not correct - those are the current value of those stats at this exact point in time. They are not lifetime stats. You need to stop thinking that these values exist and are stored and tracked but the arcache, they aren't, you need to store them. The arcache doesn't track any statistics at all, the whole point of the stats api is to let you build something that tracks the values you are interested in. For example, between the first write txn and the second, the p_weight can move up, down, or stay the same. We don't tell you which way it went. All we do is tell you "this is the p_weight at this point in time". The same with freq/rec lengths, these values move both up and down. and "all seen keys" will either go up or stay the same, but we don't tell you the difference, just the value as it exists right now. You need to calculate the difference between the previous write txn and this write txn. For example if the p_weight moves down, that indicates pressure on the recent side of the cache, which causes the frequent side to be evicted. So you can use this api to work out "oh P weight has dropped by 10% of shared max, there must be pressure". But that requires you to check the p value between the first txn and the second, then to calculate that difference and ratio to shared max between both values, and make that assessment. We don't do that for you, we just show you the values at those points in time. Does that make more sense?
No, you have to do it each write txn commit. As mentioned the values are not lifetime stats, they are the exact values at this point in time, and you need to track those events yourself, including calculating the differences as those values at those points in time change. Cache_char_stats should do nothing except return the current value of stats you have been tracking. Each time you call .commit or .finish, you need to update those stats, else your values will not be correct.
concread is a "fun one" because you have to think concurrently, about everything happening all at once. It can be tricky, but it's worth it. You can tihnk of a cowcell as a rwlock btw, where read/writes never block (unlike a rwlock). |
@Firstyear , reading your explanations my understanding is that Cache_char_stats will get sample of values. The values are valid but there may be small/significant variations between the two samples. Am I correct ? |
Description: Implement new cache statistics tracking with atomic counters and dedicated stats structs. Update concread dependency to 0.5.3 for improved cache performance. Add tests for cache statistics functionality. Fixes: 389ds#6553 Reviewed by: ?
bd67aa8
to
3e2cb43
Compare
@Firstyear Okay... Following your guidance, I've updated the code to collect stats from both Another thing.
BTW, I'm also seeing concerning memory behavior that needs investigation. With NDN cache enabled, memory usage grows significantly during exports and isn't freed afterward (unless we restart, of course). With NDN cache disabled, memory usage stays normal. ASAN shows no issues. Also, I've checked if all write and read transactions are freed on the 389 DS side, and all is good (I counted new and closed txns and the numbers where equal after the massive online export). This suggests either a concread internal issue or misuse on our part. My current thoughts:
Would appreciate any insights on the memory growth issue, but I'll keep investigating regardless... |
Collect, if you don't quiesce then new cache items will never actually be included or accessible, you need to try_quiesce after any read op. Consider reading https://github.com/kanidm/concread/blob/master/CACHE.md to understand why. The inclusion and tracking queues are bounded by the way, so they can only "buffer" so much.
Yeah, that sounds optional and better done later.
https://github.com/kanidm/concread/?tab=readme-ov-file#what-is-concurrently-readable "This is a space-time trade off, using more memory to achieve better parallel behaviour." Which is what you are seeing here. We have to keep the ghost lists to track time as well, because we need to be able to order cache events based on time too. We could investigate ways to trim the ghost lists more effectively based on epochs or similar. But yes, this is the trade of concread, you gain by having non-blocking parallel operations, but you are paying by using more ram. Realistically, the cache will grow to some X point, and then plateau because the ghost lists then settle. The amount in them shouldn't be so huge that it's a problem, and if it is then perhaps expanding the cache is worth more. This is why you could consider looking at the ghost lists as a stat to collect too. Anyway, I've opened kanidm/concread#128 which I'll work on today for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside of the changes I need to make in concread, looks good.
@droideck kanidm/concread#129 is done, will merge and release today. Should just be a version bump to 5.4 over 5.3 for you. |
@droideck 5.4 is out, give it a try :) |
Much better! The instance used only 5% or my machine's memory during the huge export. (it was nearly 19% before the update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then lets merge it :)
Description: Implement new cache statistics tracking with atomic counters
and dedicated stats structs.
Update concread dependency to 0.5.4 for improved cache performance.
Add tests for cache statistics functionality.
Fixes: #6553
Reviewed by: ?