-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests timeout with "No free disk space" error message #422
Comments
That's similar to #420 |
Found one problem old jobs occupy 33 Gb on one of the runners because autocleaner is crashing when trying to remove old jobs directories:
In cron jobs executables should be specified with full path, i.e. |
The autocleaner.py is executed by cron which sets path to PATH=/usr/bin:/bin. As exportfs is in /usr/sbin it must have full path specified, otherwise the script is crashing and fails to clean up old jobs directories Fixes freeipa#422
I've cleaned the rest of the nightly runners manually. Let's wait the next round of tests to check if the number of errors reduce. |
I have checked journalctl on nightly runners - auto-cleaner is running successfully now |
Some stats:
So the situation has improved, but still is not fixed completely.
|
The second time of timeout is due to some kind of hang during test execution. It might be due to nested virtualization and happens randomly. I don't think these are related to the issue described here. |
I managed to reproduce the situation when
Versions:
Test suite: test_integration/test_backup_and_restore.py::TestBackupAndRestoreWithReplica Here is the stack at this moment:
Full backtrace with argument values: https://gist.githubusercontent.com/wladich/a70a521a40bc5538d907abae9edad0e2/raw/996dbcdfaf70563240ee94f2202f066cb9011677/gistfile1.txt |
The frame
The update at which the process stuck:
|
Tails of log files:
/var/log/dirsrv/slapd-IPA-TEST/access
/var/log/dirsrv/slapd-IPA-TEST/errors
|
@flo-renaud @fcami |
This could be a DS problem, did anyone confirm if DS was hung? If it is, we just need a stack trace from ns-slapd to confirm what the server is doing (or not doing). |
I agree with @mreynolds389 it could be a DS problem. Looking DS access logs it seems that the following ADD never completed [08/Mar/2021:18:09:15.263940289 +0000] conn=4 op=563 ADD dn="cn=FleetCommander Desktop Profile Administrators,cn=privileges,cn=pbac,dc=ipa,dc=test" To help DS diag we would need a pstack/"top -H -p |
|
The hang is triggered by a cos cacher rebuilt, that try to acquire a RWlock in write mode Thread 6 : rebuilds cos cache (cos_cache_build_definition_list) trying to acquire vattr rwlock (in write priority) Thread 20: ADD an entry is blocked by Thread 6, waiting for vattr rwlock, while holding DB pages Thread 14: ADD an entry is blocked by Thread 20, waiting on DB pages Thread 11: is block because of 20 and nsslapd-global-backend-lock: on Thread 10: is blocked by Thread 6, waiting for vattr rwlock Thread 27: is blocked by Thread 6, waiting for vattr rwlock Thread 6 (Thread 0x7f43df6b1640 (LWP 32404) "ns-slapd"): #0 futex_abstimed_wait (private=0, abstime=0x0, clockid=0, expected=2, futex_word=) at ../sysdeps/nptl/futex-internal.h:284 #1 __pthread_rwlock_wrlock_full (abstime=0x0, clockid=0, rwlock=0x7f43f5a28240) at pthread_rwlock_common.c:830 #2 __GI___pthread_rwlock_wrlock (rwlock=0x7f43f5a28240) at pthread_rwlock_wrlock.c:27 #3 0x00007f43f7139f43 in slapi_rwlock_wrlock (rwlock=) at ldap/servers/slapd/slapi2runtime.c:305 #4 0x00007f43f713a48f in vattr_map_insert (vae=0x7f43f0d5fda0) at ldap/servers/slapd/vattr.c:2116 .. #7 0x00007f43f27becb3 in cos_dn_defs_cb (e=0x7f43f0d50380, callback_data=0x7f43df6af3e0) at ldap/servers/plugins/cos/cos_cache.c:859 #8 0x00007f43f711511e in send_ldap_search_entry_ext (pb=0x7f43f221da20, e=0x7f43f0d50380, ectrls=0x0, attrs=0x0, attrsonly=0, send_result=send_result@entry=0, nentries=0, urls=0x0) at ldap/servers/slapd/result.c:1507 ... #15 0x00007f43f71020a4 in slapi_search_internal_callback_pb (pb=pb@entry=0x7f43f221da20, callback_data=callback_data@entry=0x7f43df6af3e0, prc=prc@entry=0x0, psec=psec@entry=0x7f43f27be360 , prec=prec@entry=0x0) at ldap/servers/slapd/plugin_internal_op.c:520 .. #17 cos_cache_build_definition_list (vattr_cacheable=0x7f43f1b94f48, pDefs=0x7f43f1b94f20) at ldap/servers/plugins/cos/cos_cache.c:666 ... #20 0x00007f43f27c2195 in cos_cache_wait_on_change (arg=) at ldap/servers/plugins/cos/cos_cache.c:419 cos cache rebuilt should not hang itself because there is no thread holding 'the_map->lock' in backstack (gdb) print *the_map->lock $41 = {__data = {__readers = 14, __writers = 0, __wrphase_futex = 2, __writers_futex = 1, __pad3 = 0, __pad4 = 0, __cur_writer = 0, __shared = 0, __rwelision = 0 '\000', __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 2}, __size = "\016\000\000\000\000\000\000\000\002\000\000\000\001", '\000' , "\002\000\000\000\000\000\000", __align = 14} The bug may have existed since a long time but was not an issue until Writers got the priority In conclusion the most probable cause is it exists rdlock leak on the RWlock, some thread forgetting to release the lock. |
I have looked at nightly test results for Fedora 32 ([testing_master_previous]) and Fedora 33 ([testing_master_latest]). If we ignore failures due to low disk space, we have: Fedora 33
Fedora 32
And only once it happened during `ipactl restart: runner.log
|
@wladich thanks for this status. It will help to debug/verify a fix. |
Issue reproduced for Fedora 32. It seems to be quite similar to one on Fedora 33. The main difference is the last ipa-server-install message. Versions:
Test output tail:
Stack of ipa-server-install:
Head of
ipaserver-install.log tail:
/var/log/dirsrv/slapd-IPA-TEST/access tail:
/var/log/dirsrv/slapd-IPA-TEST/errors tail:
|
@netoarmando Hi |
Strangely same happened for the same test suite in another test run (959): |
The problem is caused by huge log file |
Issue: https://pagure.io/freeipa/issue/8877 |
@wladich I cleaned all runners to delete those long logs, 4 runners were storing 16GB+ inside jobs directory. |
Thank you! |
All issues related to disk space are now resolved, closing issue |
Timeout can occur
In both cases the current process looks like being stuck.
The text was updated successfully, but these errors were encountered: