-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add epoll support for issue 5181 #5182
base: main
Are you sure you want to change the base?
Conversation
Great to see this back on the table! We did some initial testing with the old patch (which you mentioned had some bugs in it, incorrect use of sizeof, etc), and it was actually performing worse than our existing connection code. So we look forward to testing this latest patch. @tbordaz did that initial testing, and perhaps he can share with you the tooling we used so you can play around with it as well. |
Hi, Can someone link this to issue 5181? I can't seem to figure out how to do that. Thanks. --Larry |
Sorry for the delay in getting the PR back online, but all the paperwork and approvals are done. I look forward to any feedback from you all and can hopefully help address any issue that are discovered. If epoll isn't providing any performance or architectural benefit that is ok, at least the option was explored.
|
Moving to epoll is the right thing to do, it should technically be able to service things faster - it just might require more code changes :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks really good. A few small details, and one larger change in here, but it looks really good. Ithink the fact that it's behind an ifdef is excellent, because that means we should consider merging it sooner than later so that we can start to more extensively test this, and we have a fast rollback path if something goes sideways. Great work!
/* we try to acquire the connection mutex, if it is already | ||
* acquired by another thread, don't wait | ||
*/ | ||
if (pthread_mutex_trylock(&(c->c_mutex)) == EBUSY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could consider doing rc = trylock, then check rc != 0, and if it's a weird code log that (weird being not EBUSY, see man 3 pthread_mutex_trylock )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates per suggestion, please review new update.
@@ -4792,6 +4822,25 @@ config_set_maxdescriptors(const char *attrname, char *value, char *errorbuf, int | |||
} | |||
} | |||
|
|||
#ifdef LINUX | |||
#ifdef ENABLE_EPOLL | |||
maxVal = (int64_t)slapdFrontendConfig->maxuserwatches; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should add an extra setting here. maxuserwatches is always going to have to be at least equal or greater than conntable, meaning that we already have a setting for this as a limit. So I'd say strip out all the maxuserwatches stuff and check that conntable size is =< fs.maxuserwatches and that's it. Then if conntable > fs.maxuserwaches, then clamp conntable to fs.maxuserwatches.
In general "adding more tunables" is not a good use of time/effort since most tunables are never altered or touched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the assessment about creating excess tunables. I didn't actually create a new tunable. max_user_watches is stored in slapdFrontendConfig so that it can be accessed from multiple locations. At all of the locations where related tunables are configured, they are clamped to the smaller of the specific tunable or max_user_watches. This seemed the best solution given the number of tunables that clamp the connection table size.
I was concerned that without adding the other clamps it might be confusing for administrators to be denied a tunable change with little or no obvious explanation.
If it is preferable, I could remove all clamps except for the one in config_set_conntablesize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO those checking looks valid to me. Before setting descriptors, conntable-sz, reserveddesc it checks the set values are aligned with max_user_watches, that looks safe and verbose enough to help the admin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, in that case please remove the tunable for maxuserwatches, and just check that it's value wtih the conntable size :)
The Patch looks good to me.. and really nice. I did some performance tests but unexpectedly saw no gain and even some loss (vs 2.0) **With #5181 patch** ./perf_test.sh 55 50 (listener 99% CPU) (20% slower than regular) Recent Recent Recent Recent Overall Overall Searches/Sec Avg Dur ms Entries/Srch Errors/Sec Searches/Sec Avg Dur ms ------------ ------------ ------------ ------------ ------------ ------------ 113174.013 0.483 1.000 0.000 113173.981 0.483 118207.979 0.464 1.000 0.000 115690.977 0.473 114818.793 0.477 1.000 0.000 115400.277 0.474 114730.341 0.478 1.000 0.000 115232.784 0.475 **389-ds-2.0** ./perf_test.sh 55 50 accept_thread => 0% (new cnx) slapd_daemon => 99.2% (new op) Recent Recent Recent Recent Overall Overall Searches/Sec Avg Dur ms Entries/Srch Errors/Sec Searches/Sec Avg Dur ms ------------ ------------ ------------ ------------ ------------ ------------ 143122.319 0.379 1.000 0.000 143122.281 0.379 147573.311 0.371 1.000 0.000 145347.574 0.375 146279.148 0.375 1.000 0.000 145658.129 0.375 So in terms of throughput, epoll version is ~20% slower, in terms of CPU consumption of listener it is equivalent. Measure of 10K conn **epoll.5181** 10000 connections created 2 search on 10000 connections, complete. Unbind complete for 10000 connections real 1m43.165s user 0m3.541s sys 0m2.804s **389-ds2.0** 10000 connections created 2 search on 10000 connections, complete. Unbind complete for 10000 connections accept_thread 9392 = 4% slapd_daemon 9351 = 0% accept_thread 9392 = 0% slapd_daemon 9351 = 99.3% real 1m3.962s user 0m3.384s sys 0m2.438s So in terms of 10K established connection, epoll is ~60% slower than 2.0 and equivalent in terms of CPU consumption |
@tbordaz thank you for the quantifying the performance of the epoll patch. I'm obviously very disappointed with the result. I don't see any obvious issue with the implementation but I will review the code again. If you, or anyone else, has any suggestion as to the failure, I would greatly appreciate your feedback. I suspect the issue may be with epoll_pr_idle_pds and connections being transitioned to closure, but this is nothing more than a suspicion. |
I have no idea what is could be related to that loss. I see epoll_pr_idle_pds not very different that setup_pr_read_pds going through the active connections. My initial thought was that epoll was somehow transferring CPU userspace (scanning the poll array) to the kernelspace (scanning the poll array to prepare read_ready array), so I was initially thinking of similar overall perf epoll vs poll. It is also disappointing to see loss. It would require more tests to identify where time is spent. The overall patch is (very) good to me. I do not know how to progress from that. Should we merge the patch with ifdef, then eventually change the ifdef with a config toggle ? Another option split the patch between accept_thread and listener. My tests were showing (TBC) equivalent perf with accept_thread epoll vs poll. So we could merge accept_thread to epoll and #ifdef listener part. |
Just curious: @tbordaz what kind of connection the performance tool is using ? (short term ? or persistent ?) |
@progier389 here are the description of the two tests Search througput: N clients thread opens a single connection each, bind and send infinite number of synchronous searches [01/Mar/2022:10:13:05.032462875 -0500] conn=45 fd=106 slot=106 connection from client to server [01/Mar/2022:10:13:05.032717061 -0500] conn=45 op=0 BIND dn="cn=Directory manager" method=128 version=3 [01/Mar/2022:10:13:05.032777631 -0500] conn=45 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000033940 optime=0.000064496 etime=0.000097586 dn="cn=directory manager" [01/Mar/2022:10:13:05.034030702 -0500] conn=46 fd=107 slot=107 connection from client to server [01/Mar/2022:10:13:05.034186763 -0500] conn=46 op=0 BIND dn="cn=Directory manager" method=128 version=3 [01/Mar/2022:10:13:05.034367428 -0500] conn=46 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000028356 optime=0.000172710 etime=0.000198692 dn="cn=directory manager" ... [01/Mar/2022:10:13:06.266385380 -0500] conn=45 op=1798 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06616)" attrs="distinguishedName" [01/Mar/2022:10:13:06.266399608 -0500] conn=46 op=2415 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06620)" attrs="distinguishedName" [01/Mar/2022:10:13:06.266435219 -0500] conn=45 op=1798 RESULT err=0 tag=101 nentries=1 wtime=0.000169448 optime=0.000049963 etime=0.000218492 [01/Mar/2022:10:13:06.266442879 -0500] conn=46 op=2415 RESULT err=0 tag=101 nentries=1 wtime=0.000148119 optime=0.000043390 etime=0.000190653 [01/Mar/2022:10:13:06.266723349 -0500] conn=46 op=2416 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06665)" attrs="distinguishedName" [01/Mar/2022:10:13:06.266748568 -0500] conn=45 op=1799 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06666)" attrs="distinguishedName" [01/Mar/2022:10:13:06.266811203 -0500] conn=45 op=1799 RESULT err=0 tag=101 nentries=1 wtime=0.000194276 optime=0.000062814 etime=0.000256238 [01/Mar/2022:10:13:06.266834464 -0500] conn=46 op=2416 RESULT err=0 tag=101 nentries=1 wtime=0.000170698 optime=0.000111508 etime=0.000280912 ... [01/Mar/2022:10:13:17.188528409 -0500] conn=45 op=24167 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user04034)" attrs="distinguishedName" [01/Mar/2022:10:13:17.279252865 -0500] conn=46 op=25531 RESULT err=0 tag=101 nentries=1 wtime=0.000135211 optime=0.100236121 etime=0.100366851 [01/Mar/2022:10:13:17.288799914 -0500] conn=45 op=24167 RESULT err=0 tag=101 nentries=1 wtime=0.000122219 optime=0.100267877 etime=0.100385760 [01/Mar/2022:10:13:17.383157297 -0500] conn=46 op=-1 fd=107 closed error - B1 [01/Mar/2022:10:13:17.383215892 -0500] conn=45 op=-1 fd=106 closed error - B1 10K use case A single client opens 10K connections, bind and sends a synchronous search on each connection [01/Mar/2022:10:26:18.954343720 -0500] conn=6270 fd=6276 slot=6276 connection from client to server [01/Mar/2022:10:26:18.955272018 -0500] conn=6271 fd=6277 slot=6277 connection from client to server [01/Mar/2022:10:26:18.956261209 -0500] conn=6271 op=0 BIND dn="cn=directory manager" method=128 version=3 [01/Mar/2022:10:26:18.956261361 -0500] conn=6270 op=0 BIND dn="cn=directory manager" method=128 version=3 [01/Mar/2022:10:26:18.956314499 -0500] conn=6271 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000009142 optime=0.000056602 etime=0.000065016 dn="cn=directory manager" [01/Mar/2022:10:26:18.956325437 -0500] conn=6270 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000012630 optime=0.000067354 etime=0.000079061 dn="cn=directory manager" .. [01/Mar/2022:10:26:51.291533809 -0500] conn=6271 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(uid=scarter)" attrs=ALL [01/Mar/2022:10:26:51.291585931 -0500] conn=6271 op=1 RESULT err=0 tag=101 nentries=0 wtime=0.000207695 optime=0.000053214 etime=0.000259783 [01/Mar/2022:10:26:51.296058736 -0500] conn=6272 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(uid=scarter)" attrs=ALL [01/Mar/2022:10:26:51.296110817 -0500] conn=6272 op=1 RESULT err=0 tag=101 nentries=0 wtime=0.000204513 optime=0.000053246 etime=0.000256633 .. [01/Mar/2022:10:27:54.795958317 -0500] conn=6270 op=3 UNBIND [01/Mar/2022:10:27:54.795964540 -0500] conn=6270 op=3 fd=6276 closed error - U1 [01/Mar/2022:10:27:54.796241194 -0500] conn=6271 op=3 UNBIND [01/Mar/2022:10:27:54.796247432 -0500] conn=6271 op=3 fd=6277 closed error - U1 |
@tbordaz could you point me at the tooling or scripts so I might be able to reproduce it at my location? I’m poking around at the code but it’s difficult to for me to measure my result using slamd and replaying production logs.
From: tbordaz ***@***.***>
Date: Tuesday, March 1, 2022 at 12:25 PM
To: 389ds/389-ds-base ***@***.***>
Cc: Lile, Larry ***@***.***>, Author ***@***.***>
Subject: Re: [389ds/389-ds-base] Add epoll support for issue 5181 (PR #5182)
NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Just curious: @tbordaz<https://github.com/tbordaz> what kind of connection the performance tool is using ? (short term ? or persistent ?) That should help to understand which part of the code are you stressing (the accept() handling or the standard operation ? )
@progier389<https://github.com/progier389> here are the description of the two tests
Search througput:
N clients thread opens a single connection each, bind and send infinite number of synchronous searches
[01/Mar/2022:10:13:05.032462875 -0500] conn=45 fd=106 slot=106 connection from client to server
[01/Mar/2022:10:13:05.032717061 -0500] conn=45 op=0 BIND dn="cn=Directory manager" method=128 version=3
[01/Mar/2022:10:13:05.032777631 -0500] conn=45 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000033940 optime=0.000064496 etime=0.000097586 dn="cn=directory manager"
[01/Mar/2022:10:13:05.034030702 -0500] conn=46 fd=107 slot=107 connection from client to server
[01/Mar/2022:10:13:05.034186763 -0500] conn=46 op=0 BIND dn="cn=Directory manager" method=128 version=3
[01/Mar/2022:10:13:05.034367428 -0500] conn=46 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000028356 optime=0.000172710 etime=0.000198692 dn="cn=directory manager"
...
[01/Mar/2022:10:13:06.266385380 -0500] conn=45 op=1798 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06616)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266399608 -0500] conn=46 op=2415 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06620)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266435219 -0500] conn=45 op=1798 RESULT err=0 tag=101 nentries=1 wtime=0.000169448 optime=0.000049963 etime=0.000218492
[01/Mar/2022:10:13:06.266442879 -0500] conn=46 op=2415 RESULT err=0 tag=101 nentries=1 wtime=0.000148119 optime=0.000043390 etime=0.000190653
[01/Mar/2022:10:13:06.266723349 -0500] conn=46 op=2416 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06665)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266748568 -0500] conn=45 op=1799 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06666)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266811203 -0500] conn=45 op=1799 RESULT err=0 tag=101 nentries=1 wtime=0.000194276 optime=0.000062814 etime=0.000256238
[01/Mar/2022:10:13:06.266834464 -0500] conn=46 op=2416 RESULT err=0 tag=101 nentries=1 wtime=0.000170698 optime=0.000111508 etime=0.000280912
...
[01/Mar/2022:10:13:17.188528409 -0500] conn=45 op=24167 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user04034)" attrs="distinguishedName"
[01/Mar/2022:10:13:17.279252865 -0500] conn=46 op=25531 RESULT err=0 tag=101 nentries=1 wtime=0.000135211 optime=0.100236121 etime=0.100366851
[01/Mar/2022:10:13:17.288799914 -0500] conn=45 op=24167 RESULT err=0 tag=101 nentries=1 wtime=0.000122219 optime=0.100267877 etime=0.100385760
[01/Mar/2022:10:13:17.383157297 -0500] conn=46 op=-1 fd=107 closed error - B1
[01/Mar/2022:10:13:17.383215892 -0500] conn=45 op=-1 fd=106 closed error - B1
10K use case
A single client opens 10K connections, bind and sends a synchronous search on each connection
[01/Mar/2022:10:26:18.954343720 -0500] conn=6270 fd=6276 slot=6276 connection from client to server
[01/Mar/2022:10:26:18.955272018 -0500] conn=6271 fd=6277 slot=6277 connection from client to server
[01/Mar/2022:10:26:18.956261209 -0500] conn=6271 op=0 BIND dn="cn=directory manager" method=128 version=3
[01/Mar/2022:10:26:18.956261361 -0500] conn=6270 op=0 BIND dn="cn=directory manager" method=128 version=3
[01/Mar/2022:10:26:18.956314499 -0500] conn=6271 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000009142 optime=0.000056602 etime=0.000065016 dn="cn=directory manager"
[01/Mar/2022:10:26:18.956325437 -0500] conn=6270 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000012630 optime=0.000067354 etime=0.000079061 dn="cn=directory manager"
..
[01/Mar/2022:10:26:51.291533809 -0500] conn=6271 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(uid=scarter)" attrs=ALL
[01/Mar/2022:10:26:51.291585931 -0500] conn=6271 op=1 RESULT err=0 tag=101 nentries=0 wtime=0.000207695 optime=0.000053214 etime=0.000259783
[01/Mar/2022:10:26:51.296058736 -0500] conn=6272 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(uid=scarter)" attrs=ALL
[01/Mar/2022:10:26:51.296110817 -0500] conn=6272 op=1 RESULT err=0 tag=101 nentries=0 wtime=0.000204513 optime=0.000053246 etime=0.000256633
..
[01/Mar/2022:10:27:54.795958317 -0500] conn=6270 op=3 UNBIND
[01/Mar/2022:10:27:54.795964540 -0500] conn=6270 op=3 fd=6276 closed error - U1
[01/Mar/2022:10:27:54.796241194 -0500] conn=6271 op=3 UNBIND
[01/Mar/2022:10:27:54.796247432 -0500] conn=6271 op=3 fd=6277 closed error - U1
—
Reply to this email directly, view it on GitHub<#5182 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AX6KJLWG4QQYLEMOEGI34STU5ZHKDANCNFSM5PLKDRKQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
So If I understand rightly that is: That is a bit surprising that epolln get slower than poll in that case: That said, something puzzles me in this fix: |
Well, I think the issue is here. https://github.com/389ds/389-ds-base/pull/5182/files#diff-d7be8bf18c5d95a1bec17d52c8c2f74e9273250857e64b130bf3718f0027c9a6R1685 We are still treating connections here with epoll as though they are fd's that need to be selected and setup into the table. epoll is a set of system calls which means they are thread safe. We only need to change the epollfd table when a connection is added or removed, and that does not necessitate a full rebuild of the table (via epoll_ctl DEL). So the way to resolve this, is epoll_pr_idle_pds probably doesn't need to exist as a function. We also don't need an active/inactive list either. When we set this flag, at the callsite we should be calling the epoll disarm. At the locations we change c_gettingber and thread number, we call epoll arm. Part of the value of epoll is that we don't need a single thread to manage the arming and disarming, we just need the connections to signal their state during a handling of that connection (from within the worker or reader thread). This is probably the "biggest" change that will improve our performance is that we aren't bottlenecked on waiting for this thread to cycle around and get our fds ready.
Itu should still be enforced from what I can read? |
@Firstyear, can you link this again or give a line number? For some reason all of the links are just taking me to the top of the diff. I want to make sure I am connecting the correct dots. Especially the following link:
That being said, I will whittle down epoll_pr_idle_pds and redistribute the arm/disarm logic. I had done this previously but since I appeared to be losing track of connections I backed out of that path. I discovered later the connection losses were unrelated. I believe epoll_pr_idle_pds can be reduced to managing the idle timeouts
which could be handled on a longer interval. I may need some pointers on relocating the c_max_threads_per_conn control and persistent search control logic. |
Once you click "load diff" on daemon.c it will take you to the correct line.
Yep, that's a good idea. I'd start with one use case at a time, maybe just for now the "arm" path to rearm a connection, because that will yield the first and most obvious improvement to performance for this change, and you don't have any risk to "lose" connections during a conn close.
Yep, that's completely okay, I think you're correct here. EDIT: IMO once we have re-arming completed in the worker thread, and performance is equal or better, I'd suggest we merge this and then work on reducing pr_idle_pds over time, rather than doing it all at once. |
Sorry my description was confusing. We have two tools and two runs. The first tool is to monitor search throughput (searchrate). The second tool is to monitor how fast we establish 10K connections and then send few (2) srch on each of 10K connections. The first tool: The second tool: |
Sorry about the noise upon the active table (I was tricked by my memory: I wrongly thought the active table was the polled connection while it is the open connection). |
ldap/servers/slapd/daemon.c
Outdated
|
||
num_poll = epoll_wait(the_connection_table->epollfd, ct_epoll_events, connection_table_size, slapd_wakeup_timer); | ||
|
||
// XXX This makes absolutely no sense, if this following slapi_log_err statement is removed or silenced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While cleaning up my code to further discussion, I discovered that removing this slapi_log_err or silencing it causes an order of magnitude drop in performance from ~4k/s to ~.6k/s in my testing. I'm completely befuddled by this, can someone point me in the right direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C is not a "linear" language. The code here isn't all executing one line at a time, your compiler is free to re-arrange things provided no data dependency conflicts, and your cpu will pipeline and run multiple instructions simultaneously, and in any order it wishes.
One rule in a compiler is about code hoisting/sinking where statements can occur. An if condition is a point in the code where the compiler may NOT re-arrange code and certain statement orders creating ordering. If you remove the slapi_log_err, then you create a scenario where the if condition is dead-code eliminated, meaning that other parts CAN be re-arranged. Without looking at the asm that's emitted and different between the two it's hard to know "why" this is causing such a huge performance drop.
I suspect if you replaced the if condition with:
__atomic_thread_fence(__ATOMIC_ACQ_REL)
This should cause the "code either side of this fence to remain in those locations". But that also may have it's own issues as well. I've been trying to find a way to prevent code hoist/sink without atomic's but I'm not sure it's possible.
This is just one of those lovely examples of why C is not a good language in any way.
Anyway, another option could be instead of the switch statement, to use an if like this:
if (num_poll > 0) {
} else if (num_poll < 0 ) {
}
That also may end up working nicer than the case switch, especially WRT to optimisation since the "common" case is the first one. You can also then move the if conn_count != last_count inside that loop too.
Anyway, a few ideas for you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Firstyear I understand your comments about compiler optimizations, pipelining and the resulting out of order execution that can result however I cannot devise any scenario where reordering the code would result in this performance loss without violating data dependency.
However, I have rewritten the suspect code segment in various different ways, changing the semantics between if-then/case, nesting/de-nesting operations, etc. I can find no solution that preserves the performance without the call to slapi_log_err. The location of the call to slapi_log_err and message are irrelevant. At any point within the while(!g_get_shutdown) loop emitting any message from slapi_log_err corrects the performance regression.
This brings me to the conclusion that this must be a thread yield or thread scheduling problem that is resolved by the action of slapi_log_err local to the thread, which makes me think the thread local I/O is waking/scheduling the thread. Based on that I am beginning to suspect that moving my code away from the signalpipe semantic was not a correct decision. I am going to pull the signalpipe semantic back into the epoll code to confirm this suspicion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding signalpipe to the connection table polling did nothing to improve the regression.
I'm at a complete loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you suspect it's a thread yield issue, can you try pthread_yield() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@llile Hey there, I've started to look into this a bit. Looking at my decompiled outputs, I don't see any difference in the results beside the added log output - no weird co-arrangement etc. Even the same variables and offsets.
Can you provide me:
- What compiler version you are using
- What build settings you are using
- a copy of build/.libs/ns-slapd (should be the ELF file) from the "with" and "without" log message variants so that I can decompile and help investigate?
- A copy of libslapd.so (should also be in .libs) from both builds. Should also be an ELF
EDIT: Saying this it's quite interesting as it appears that clang is emitting a much better connection loop compared to gcc, as gcc is putting in the g_get_shutdown() assuming it's "likely true" so all the shutdown components are preceeding the epoll wait, where as clang assumes it's "likely false" so it has shutdown at the end as per the C files. We might need to change this too:
while (!g_get_shutdown()) {
TO
while ( __builtin_expect( !g_get_shutdown(), !0)) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI @Firstyear, sorry for the delayed response.
The compiler is gcc-11.2.1-7.fc35.x86_64 and the configure settings are $ ./configure --exec-prefix=/usr --prefix=/ --datarootdir=/usr/share --datadir=/usr/share --libdir=/usr/lib64 --disable-asan --disable-tsan --enable-cmocka --with-systemd --with-selinux --with-openldap
What is the best method to provide you with the binaries?
Have you made any discoveries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that google drive or some other drop box would work. Alternately you can tar them into a PR or similar for github. Finally, if they compress to small than 10 MB you can email them to me at william.brown at suse.com :)
I did discover a bit in the reversing process but nothing obvious that would cause the delay. That's why I'm curious to see your build outputs. I'll also try to build these myself today too and compare once you send me your binaries too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had an initial look. My test results with: /opt/dirsrv/bin/ldclt -h localhost -p 3389 -N 10 -n 4 -e bindeach -e bindonly
. I avoided search paths so that this would stress only anonymous (fully in memory) binds and connections only.
GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811
GCC w_out log line: ldclt[22341]: Global average rate: 8829269.00/thr (353170.75/sec), total: 35317075
Clang w_ log line: ldclt[840]: Global average rate: 9532904.00/thr (381316.16/sec), total: 38131614
Clang w_out log line: ldclt[27872]: Global average rate: 9507196.00/thr (380287.84/sec), total: 38028785
The two clang results are within +-0.2%
which means I perceive this to be noise. The GCC result however is +- 3%
, where the presence of the log line increases performance by 3%. This is still a trivial amount IMO, but there is certainly more weight to the fact that there is a code generation issue with GCC related to the presence or absence of this log line, that does not affect clang.
I'm not seeing the 20x kind of number that you are reporting though, so I'm wondering if there is something else going on in your test methodology, as my results show it as much closer. Regardless, there is a perceptable difference with the log line with GCC.
GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811
GCC w_out log line: ldclt[22341]: Global average rate: 8829269.00/thr (353170.75/sec), total: 35317075
GCC w trace line: Global average rate: 8508577.00/thr (340343.09/sec), total: 34034307
Now if we change the log level to TRACE (so it emits nothing, but still is present) this shows that the performance decreases. This means that it is unlikely to be a code placement/alignment issue, but that some behaviour in ldap/servers/slapd/log.c slapi_log_error, after the loglevel check is influencing performance, likely be affecting cache or memory behaviour. Reading that function we can see there are two likely causes. Either the call to acquire the lock, or the call to issue the printf to the console. I attempted removing each of them in seperate builds.
Removing the either caused the performance to rise to where we expect, which makes me think it is the issuance of the syscall to either printf or lock that is the cause of the performance rise.
GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811
GCC w_out printf: ldclt[3793]: Global average rate: 9260484.00/thr (370419.38/sec), total: 37041935
GCC w_out lock: ldclt[10884]: Global average rate: 9313178.00/thr (372527.12/sec), total: 37252714
But if we remove the call to slapd_log_error_proc_internal then the performance increases to the rates we saw without the lock.
GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811
GCC w_out log_err ldclt[25077]: Global average rate: 9211057.00/thr (368442.28/sec), total: 36844228
This starts to make me think that there is a data dependence or code re-arrangement that's going on by having the call to slapi_log_error present, which impacts the behaviour of the branch predictor potentially. Anyway, so I finally tried putting in the expect and removing the log line:
GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811
GCC w_ log + expect: ldclt[32169]: Global average rate: 9226223.00/thr (369048.91/sec), total: 36904892
GCC w_out log + expect: ldclt[6841]: Global average rate: 9100628.00/thr (364025.12/sec), total: 36402510
Adding the builtin expect line as suggested gets us up to the performance with the log line, but the log line still is increasing our performance.
Next I tried just simply adding a printf("");
instead of the log line.
GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811
GCC w_ log + expect: ldclt[32169]: Global average rate: 9226223.00/thr (369048.91/sec), total: 36904892
GCC w_ printf + expect: ldclt[13949]: Global average rate: 9437884.00/thr (377515.38/sec), total: 37751534
This is starting to get up near clang performance! So we know that it's not the log itself, but likely the act of having the syscall that is helping somehow. That's probably also why the interaction with epoll_pr_idle exists.
Trying with sched yield instead of printf/slapi_log_err():
GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811
GCC w_ printf + expect: ldclt[13949]: Global average rate: 9437884.00/thr (377515.38/sec), total: 37751534
GCC w_ yield + expect: ldclt[28141]: Global average rate: 9817390.00/thr (392695.59/sec), total: 39269562
There we go! Now we are seeing some gains! The syscall to printf or lock() was likely allowing a yield point for the scheduler, so instead we can manually add one - and it's much cleaner :)
Now curiously, if we use clang with yield + expect it actually lowers performance. Again, likely just related to clang emitting better code.
So ultimately the changes needed to get GCC to clang's performance are:
diff --git a/ldap/servers/slapd/daemon.c b/ldap/servers/slapd/daemon.c
index 6f5d1f6ca..eb9f6f471 100644
--- a/ldap/servers/slapd/daemon.c
+++ b/ldap/servers/slapd/daemon.c
@@ -55,6 +55,7 @@
#include <private/pprio.h>
#include <ssl.h>
#include "fe.h"
+#include <sched.h>
#if defined(LDAP_IOCP)
#define SLAPD_WAKEUP_TIMER 250
@@ -1156,12 +1157,22 @@ slapd_daemon(daemon_ports_t *ports)
#endif
/* The meat of the operation is in a loop on a call to select */
- while (!g_get_shutdown()) {
+ while (
+#ifdef __clang__
+ !g_get_shutdown()
+#else
+ __builtin_expect(!g_get_shutdown(), !0)
+#endif
+ ) {
#ifdef ENABLE_EPOLL
// XXX emitting any message between here and #else, provides 20x performance improvement
// XXX except as noted below near epoll_pr_idle
- slapi_log_err(SLAPI_LOG_ERR, "slapd_daemon", "epoll mystery\n");
+#ifndef __clang__
+ sched_yield();
+#endif
// XXX There appears to be a slight performance improvment with if/else vs. case as firstyear predicted
if ((num_poll = epoll_wait(the_connection_table->epollfd, ct_epoll_events, connection_table_size, slapd_wakeup_timer)) > 0) {
I have tidied up the code, noting the performance anomaly with XXX markers. I do not comprehend this issue. @tbordaz could you possibly analyze this version versus 2.x? |
@llile, I am really sorry for this late feedback. I will rerun the tests on the latest version and describe the details of the test and what I am observing. My test environment got cleared and I had to recreate it. |
The performance (throughput and response time) of recent patch are similar as with the first patch. With epoll, throughput drops by 40% and response time increase by 60%. Using ping searchrate tool. 389-ds-base-2.0 ldapbench/bin/searchrate --hostname server --port 389 --bindDN "cn=Directory manager" --bindPassword password --baseDN dc=example,dc=com --scope sub --filter "(uid=user[0:10000%00000])" --attribute dn --numThreads 50 -I 5 Recent Recent Recent Recent Overall Overall Searches/Sec Avg Dur ms Entries/Srch Errors/Sec Searches/Sec Avg Dur ms ------------ ------------ ------------ ------------ ------------ ------------ 150977.257 0.329 1.000 0.000 150977.218 0.329 154457.337 0.323 1.000 0.000 152717.236 0.326 153951.556 0.324 1.000 0.000 153128.693 0.325 156661.714 0.318 1.000 0.000 154011.973 0.323 158359.078 0.315 1.000 0.000 154881.280 0.322 389-ds-base-2.0 + epoll ldapbench/bin/searchrate --hostname server --port 389 --bindDN "cn=Directory manager" --bindPassword password --baseDN dc=example,dc=com --scope sub --filter "(uid=user[0:10000%00000])" --attribute dn --numThreads 50 -I 5 Recent Recent Recent Recent Overall Overall Searches/Sec Avg Dur ms Entries/Srch Errors/Sec Searches/Sec Avg Dur ms ------------ ------------ ------------ ------------ ------------ ------------ 97315.923 0.512 1.000 0.000 97315.899 0.512 97074.039 0.513 1.000 0.000 97194.966 0.513 93545.599 0.534 1.000 0.000 95978.656 0.519 94132.058 0.530 1.000 0.000 95516.981 0.522 89023.958 0.560 1.000 0.000 94218.316 0.529 I installed a standalone instance, with very large dbcache and entry cache store all DB entries and indexes. I will attach the database ldif file |
This PR is likely impacted by multi-listener, removal turbo mode and dynamic thread management. Code changed but also new set of tests are available. |
100% agree, I think epoll should be looked at next and then the remaining bottlenecks found and cleaned up :) |
Pull request for issue 5181
Relates: #5181