-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[filesys] issues: archive bloating and plugin timeouts #3965
Comments
Oof, this one is difficult. I lean towards "the existing behavior is a bug, and I don't think option 3 is a way we want to go; Options Are Evil(tm) and I'm generally resistant to multiple plugin options around what ultimately ends up being the "same" collection.
Not a plugin option, but users (today) can use Regardless of what we decide here, we'll want to address the tagging inconsistency. |
Do I get you right that even How rare this can be? We have some commands guarded by a plugopt due to the same reason of potential timeout (but I found just the I agree the current behaviour is counter-intuitive but I dont know what should be the default behaviour (or plugin options). Due to the legacy reasons / commit history, I am inclined in collecting I am asking my filesys colleagues for a feedback to understand the users' requirements. |
I'll check the insights and the AI teams, to see if they are using that output. The way I understand the situation at the moment is: So point 2:
Could probably be: Change the dumpe2fs PluginOpt description to state the option is about collecting extended information about the filesystem, i.e. superblock and block group descriptor detail.
|
There's a fairly long history here of tension between gathering useful data vs. bloat, runtime, and IO load concerns. Roughly a decade ago we were told that The history of the option goes like this: The https://bugzilla.redhat.com/show_bug.cgi?id=443037 It then got reverted back in commit 8d443ca, and switched back to The (this is all before my time) In commit 868acaf I changed In commit d6272e0 the current situation of collecting https://bugzilla.redhat.com/show_bug.cgi?id=1105629 The incorrect tagging was added in 7761406 - most likely Insights at this point only knew about the default Collecting
Other way around: with Option 1. may not be viable since collecting Option 2. is I think a good idea irrespective of the other decisions. The current description string is highly misleading:
That should at least be something like "dump full filesystem info" or "dump filesystem info with group descriptors". Option 3. I tend to agree with @TurboTurtle. More options isn't the answer here. Perhaps one alternative would be to restrict the set of devices we operate on, either by an arbitrary cutoff or by limiting the collection to "system devices" (for some definition of "system devices"). This would complicate things but it might be the only reasonable middle ground between support who want this always-on and large system users who are impacted by the time consumption and IO generated by running these extra commands. |
@pmoravec I believe so. The output I see is usually The commands were taking anywhere from around 20 seconds to over 300 seconds to complete per device. The 300 seconds (in this instance e2freefrag results) was during 80% idle CPU and an IO Wait of sub 5%. When systems have been freshly rebooted either command appears to return almost instantly, or at least quick enough not to be any concern with a default plugin timeout value even when having around 60 storage devices to enumerate. It's possible the test (manually performed) did not include the -h however. When I had someone else test the other week I only had them recreate the results with |
For clarity, I observe the issues with EDIT: |
I'll inquire with the engineer in France if we can access one of the platforms we recently observed the issue on and if we can retest with |
Yep, poorly worded on my part, should have said does not collect only superblock information. |
I have been investigating some issues with the
filesys
plugin.sos_commands/filesys
content.Bloating
I observed this over time, generally 3-9 months of server uptime. It appears related to files in
/proc/fs/ext4/<device>/mb_groups
growing to 10MB, or greater, when there are 20-60 storage drives. I don't expect any solution at sos report level, just reporting it since I noticed it while tracking issues in filesys.Timeouts
The timeouts appear to be specific to
dumpe2fs
taking extremely long times to run. I intentionally usedumpe2fs
output, so have been extending theplugin-timeout
option to work around the plugin timeouts, but started to investigate due to setting plugin timeout to 900 or even 1500 still resulting in having no filesys command output included in the sos archive.Based on the PluginOption definitions it would appear that.
lsof
,dumpe2fs
ande2freefrag
should be optional and are set todefault=False
.sos/sos/report/plugins/filesys.py
Lines 29 to 35 in 46f9bbe
Except
dumpe2fs
exists insos_commands/filesys
for every collection that does not timeout. I actually thought it was enabled by default until reviewing the filesys plugin more closely:sos/sos/report/plugins/filesys.py
Lines 80 to 90 in 46f9bbe
Inside the block iterating over devices:
dumpe2fs
is always executed.dumpe2fs_h
regardless if the-h
option is included.e2freefrag
is not executed unless its PluginOpt is set to True.So
dumpe2fs
simply removes the-h
option by default and does not collects superblock information.Possible Solutions
dumpe2fs
is disabled unless it's PluginOpt is set to True.-k filesys.dumpe2fs
-k filesys.dumpe2fs
default=False
& change PluginOpt for dumpe2fs todefault=True
-k filesys.dumpe2fs
I'm hesitant to just select what I think is the best option and open a PR without feedback from others.
@TurboTurtle @pmoravec @arif-ali @jcastill I would appreciate feedback that can be provided. I'm happy to make the commits and open the PR as long as a consensus can be reached about the correct solution to use.
dumpe2fs
is enabled by default and rely on itThe text was updated successfully, but these errors were encountered: