Add epoll support for issue 5181 #5182

llile · 2022-02-25T21:19:32Z

Pull request for issue 5181

Relates: #5181

mreynolds389 · 2022-02-25T21:26:14Z

Great to see this back on the table! We did some initial testing with the old patch (which you mentioned had some bugs in it, incorrect use of sizeof, etc), and it was actually performing worse than our existing connection code. So we look forward to testing this latest patch. @tbordaz did that initial testing, and perhaps he can share with you the tooling we used so you can play around with it as well.

llile · 2022-02-25T21:59:17Z

Hi,

Can someone link this to issue 5181? I can't seem to figure out how to do that.

Thanks.

--Larry

llile · 2022-02-25T22:07:40Z

Sorry for the delay in getting the PR back online, but all the paperwork and approvals are done.

I look forward to any feedback from you all and can hopefully help address any issue that are discovered. If epoll isn't providing any performance or architectural benefit that is ok, at least the option was explored.

Great to see this back on the table! We did some initial testing with the old patch (which you mentioned had some bugs in it, incorrect use of sizeof, etc), and it was actually performing worse than our existing connection code. So we look forward to testing this latest patch. @tbordaz did that initial testing, and perhaps he can share with you the tooling we used so you can play around with it as well.

mreynolds389 · 2022-02-25T22:12:50Z

Sorry for the delay in getting the PR back online, but all the paperwork and approvals are done.

I look forward to any feedback from you all and can hopefully help address any issue that are discovered. If epoll isn't providing any performance or architectural benefit that is ok, at least the option was explored.

Moving to epoll is the right thing to do, it should technically be able to service things faster - it just might require more code changes :-)

Firstyear

Mostly looks really good. A few small details, and one larger change in here, but it looks really good. Ithink the fact that it's behind an ifdef is excellent, because that means we should consider merging it sooner than later so that we can start to more extensively test this, and we have a fast rollback path if something goes sideways. Great work!

ldap/servers/slapd/daemon.c

Firstyear · 2022-02-28T05:00:46Z

ldap/servers/slapd/daemon.c

+            /* we try to acquire the connection mutex, if it is already
+             * acquired by another thread, don't wait
+             */
+            if (pthread_mutex_trylock(&(c->c_mutex)) == EBUSY) {


Could consider doing rc = trylock, then check rc != 0, and if it's a weird code log that (weird being not EBUSY, see man 3 pthread_mutex_trylock )

Updates per suggestion, please review new update.

ldap/servers/slapd/daemon.c

Firstyear · 2022-02-28T05:07:06Z

ldap/servers/slapd/libglobs.c

@@ -4792,6 +4822,25 @@ config_set_maxdescriptors(const char *attrname, char *value, char *errorbuf, int
        }
    }

+#ifdef LINUX
+#ifdef ENABLE_EPOLL
+    maxVal = (int64_t)slapdFrontendConfig->maxuserwatches;


I don't think we should add an extra setting here. maxuserwatches is always going to have to be at least equal or greater than conntable, meaning that we already have a setting for this as a limit. So I'd say strip out all the maxuserwatches stuff and check that conntable size is =< fs.maxuserwatches and that's it. Then if conntable > fs.maxuserwaches, then clamp conntable to fs.maxuserwatches.

In general "adding more tunables" is not a good use of time/effort since most tunables are never altered or touched.

I agree with the assessment about creating excess tunables. I didn't actually create a new tunable. max_user_watches is stored in slapdFrontendConfig so that it can be accessed from multiple locations. At all of the locations where related tunables are configured, they are clamped to the smaller of the specific tunable or max_user_watches. This seemed the best solution given the number of tunables that clamp the connection table size.

I was concerned that without adding the other clamps it might be confusing for administrators to be denied a tunable change with little or no obvious explanation.

If it is preferable, I could remove all clamps except for the one in config_set_conntablesize.

IMHO those checking looks valid to me. Before setting descriptors, conntable-sz, reserveddesc it checks the set values are aligned with max_user_watches, that looks safe and verbose enough to help the admin.

Okay, in that case please remove the tunable for maxuserwatches, and just check that it's value wtih the conntable size :)

tbordaz · 2022-02-28T17:43:32Z

The Patch looks good to me.. and really nice.

I did some performance tests but unexpectedly saw no gain and even some loss (vs 2.0)
Measure of throughput (subtree search with indexed search returning 1 entry):

**With #5181 patch**
./perf_test.sh 55 50 (listener 99% CPU)
(20% slower than regular)
      Recent       Recent       Recent       Recent      Overall      Overall
Searches/Sec   Avg Dur ms Entries/Srch   Errors/Sec Searches/Sec   Avg Dur ms
------------ ------------ ------------ ------------ ------------ ------------
  113174.013        0.483        1.000        0.000   113173.981        0.483
  118207.979        0.464        1.000        0.000   115690.977        0.473
  114818.793        0.477        1.000        0.000   115400.277        0.474
  114730.341        0.478        1.000        0.000   115232.784        0.475

**389-ds-2.0**
./perf_test.sh 55 50
        accept_thread => 0%   (new cnx)
        slapd_daemon  => 99.2%  (new op)
      Recent       Recent       Recent       Recent      Overall      Overall
Searches/Sec   Avg Dur ms Entries/Srch   Errors/Sec Searches/Sec   Avg Dur ms
------------ ------------ ------------ ------------ ------------ ------------
  143122.319        0.379        1.000        0.000   143122.281        0.379
  147573.311        0.371        1.000        0.000   145347.574        0.375
  146279.148        0.375        1.000        0.000   145658.129        0.375

So in terms of throughput, epoll version is ~20% slower, in terms of CPU consumption of listener it is equivalent.

Measure of 10K conn

**epoll.5181**
10000 connections created
2 search on 10000 connections, complete.
Unbind complete for 10000 connections

real    1m43.165s
user    0m3.541s
sys     0m2.804s

**389-ds2.0**
10000 connections created
2 search on 10000 connections, complete.
Unbind complete for 10000 connections
        accept_thread 9392 = 4%
        slapd_daemon 9351 =  0%

        accept_thread 9392 = 0%
        slapd_daemon 9351 = 99.3%

real    1m3.962s
user    0m3.384s
sys     0m2.438s

So in terms of 10K established connection, epoll is ~60% slower than 2.0 and equivalent in terms of CPU consumption

llile · 2022-03-01T00:51:28Z

The Patch looks good to me.. and really nice.

I did some performance tests but unexpectedly saw no gain and even some loss (vs 2.0) Measure of throughput (subtree search with indexed search returning 1 entry):
So in terms of throughput, epoll version is ~20% slower, in terms of CPU consumption of listener it is equivalent.
So in terms of 10K established connection, epoll is ~60% slower than 2.0 and equivalent in terms of CPU consumption

@tbordaz thank you for the quantifying the performance of the epoll patch. I'm obviously very disappointed with the result. I don't see any obvious issue with the implementation but I will review the code again. If you, or anyone else, has any suggestion as to the failure, I would greatly appreciate your feedback.

I suspect the issue may be with epoll_pr_idle_pds and connections being transitioned to closure, but this is nothing more than a suspicion.

tbordaz · 2022-03-01T13:08:21Z

The Patch looks good to me.. and really nice.

I did some performance tests but unexpectedly saw no gain and even some loss (vs 2.0) Measure of throughput (subtree search with indexed search returning 1 entry):
So in terms of throughput, epoll version is ~20% slower, in terms of CPU consumption of listener it is equivalent.
So in terms of 10K established connection, epoll is ~60% slower than 2.0 and equivalent in terms of CPU consumption

@tbordaz thank you for the quantifying the performance of the epoll patch. I'm obviously very disappointed with the result. I don't see any obvious issue with the implementation but I will review the code again. If you, or anyone else, has any suggestion as to the failure, I would greatly appreciate your feedback.

I suspect the issue may be with epoll_pr_idle_pds and connections being transitioned to closure, but this is nothing more than a suspicion.

I have no idea what is could be related to that loss. I see epoll_pr_idle_pds not very different that setup_pr_read_pds going through the active connections. My initial thought was that epoll was somehow transferring CPU userspace (scanning the poll array) to the kernelspace (scanning the poll array to prepare read_ready array), so I was initially thinking of similar overall perf epoll vs poll. It is also disappointing to see loss. It would require more tests to identify where time is spent.

The overall patch is (very) good to me. I do not know how to progress from that. Should we merge the patch with ifdef, then eventually change the ifdef with a config toggle ?

Another option split the patch between accept_thread and listener. My tests were showing (TBC) equivalent perf with accept_thread epoll vs poll. So we could merge accept_thread to epoll and #ifdef listener part.

progier389 · 2022-03-01T14:32:33Z

Just curious: @tbordaz what kind of connection the performance tool is using ? (short term ? or persistent ?)
That should help to understand which part of the code are you stressing (the accept() handling or the standard operation ? )

tbordaz · 2022-03-01T17:25:40Z

Just curious: @tbordaz what kind of connection the performance tool is using ? (short term ? or persistent ?) That should help to understand which part of the code are you stressing (the accept() handling or the standard operation ? )

@progier389 here are the description of the two tests

Search througput:
	N clients thread opens a single connection each, bind and send infinite number of synchronous searches
[01/Mar/2022:10:13:05.032462875 -0500] conn=45 fd=106 slot=106 connection from client to server
[01/Mar/2022:10:13:05.032717061 -0500] conn=45 op=0 BIND dn="cn=Directory manager" method=128 version=3
[01/Mar/2022:10:13:05.032777631 -0500] conn=45 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000033940 optime=0.000064496 etime=0.000097586 dn="cn=directory manager"
[01/Mar/2022:10:13:05.034030702 -0500] conn=46 fd=107 slot=107 connection from client to server
[01/Mar/2022:10:13:05.034186763 -0500] conn=46 op=0 BIND dn="cn=Directory manager" method=128 version=3
[01/Mar/2022:10:13:05.034367428 -0500] conn=46 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000028356 optime=0.000172710 etime=0.000198692 dn="cn=directory manager"
...
[01/Mar/2022:10:13:06.266385380 -0500] conn=45 op=1798 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06616)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266399608 -0500] conn=46 op=2415 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06620)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266435219 -0500] conn=45 op=1798 RESULT err=0 tag=101 nentries=1 wtime=0.000169448 optime=0.000049963 etime=0.000218492
[01/Mar/2022:10:13:06.266442879 -0500] conn=46 op=2415 RESULT err=0 tag=101 nentries=1 wtime=0.000148119 optime=0.000043390 etime=0.000190653
[01/Mar/2022:10:13:06.266723349 -0500] conn=46 op=2416 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06665)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266748568 -0500] conn=45 op=1799 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06666)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266811203 -0500] conn=45 op=1799 RESULT err=0 tag=101 nentries=1 wtime=0.000194276 optime=0.000062814 etime=0.000256238
[01/Mar/2022:10:13:06.266834464 -0500] conn=46 op=2416 RESULT err=0 tag=101 nentries=1 wtime=0.000170698 optime=0.000111508 etime=0.000280912
...
[01/Mar/2022:10:13:17.188528409 -0500] conn=45 op=24167 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user04034)" attrs="distinguishedName"
[01/Mar/2022:10:13:17.279252865 -0500] conn=46 op=25531 RESULT err=0 tag=101 nentries=1 wtime=0.000135211 optime=0.100236121 etime=0.100366851
[01/Mar/2022:10:13:17.288799914 -0500] conn=45 op=24167 RESULT err=0 tag=101 nentries=1 wtime=0.000122219 optime=0.100267877 etime=0.100385760
[01/Mar/2022:10:13:17.383157297 -0500] conn=46 op=-1 fd=107 closed error - B1
[01/Mar/2022:10:13:17.383215892 -0500] conn=45 op=-1 fd=106 closed error - B1

10K use case
	A single client opens 10K connections, bind and sends a synchronous search on each connection

[01/Mar/2022:10:26:18.954343720 -0500] conn=6270 fd=6276 slot=6276 connection from client to server
[01/Mar/2022:10:26:18.955272018 -0500] conn=6271 fd=6277 slot=6277 connection from client to server
[01/Mar/2022:10:26:18.956261209 -0500] conn=6271 op=0 BIND dn="cn=directory manager" method=128 version=3
[01/Mar/2022:10:26:18.956261361 -0500] conn=6270 op=0 BIND dn="cn=directory manager" method=128 version=3
[01/Mar/2022:10:26:18.956314499 -0500] conn=6271 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000009142 optime=0.000056602 etime=0.000065016 dn="cn=directory manager"
[01/Mar/2022:10:26:18.956325437 -0500] conn=6270 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000012630 optime=0.000067354 etime=0.000079061 dn="cn=directory manager"
..
[01/Mar/2022:10:26:51.291533809 -0500] conn=6271 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(uid=scarter)" attrs=ALL
[01/Mar/2022:10:26:51.291585931 -0500] conn=6271 op=1 RESULT err=0 tag=101 nentries=0 wtime=0.000207695 optime=0.000053214 etime=0.000259783
[01/Mar/2022:10:26:51.296058736 -0500] conn=6272 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(uid=scarter)" attrs=ALL
[01/Mar/2022:10:26:51.296110817 -0500] conn=6272 op=1 RESULT err=0 tag=101 nentries=0 wtime=0.000204513 optime=0.000053246 etime=0.000256633
..
[01/Mar/2022:10:27:54.795958317 -0500] conn=6270 op=3 UNBIND
[01/Mar/2022:10:27:54.795964540 -0500] conn=6270 op=3 fd=6276 closed error - U1
[01/Mar/2022:10:27:54.796241194 -0500] conn=6271 op=3 UNBIND
[01/Mar/2022:10:27:54.796247432 -0500] conn=6271 op=3 fd=6277 closed error - U1

llile · 2022-03-01T18:27:21Z

@tbordaz could you point me at the tooling or scripts so I might be able to reproduce it at my location? I’m poking around at the code but it’s difficult to for me to measure my result using slamd and replaying production logs. From: tbordaz ***@***.***> Date: Tuesday, March 1, 2022 at 12:25 PM To: 389ds/389-ds-base ***@***.***> Cc: Lile, Larry ***@***.***>, Author ***@***.***> Subject: Re: [389ds/389-ds-base] Add epoll support for issue 5181 (PR #5182) NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe. Just curious: @tbordaz<https://github.com/tbordaz> what kind of connection the performance tool is using ? (short term ? or persistent ?) That should help to understand which part of the code are you stressing (the accept() handling or the standard operation ? ) @progier389<https://github.com/progier389> here are the description of the two tests Search througput: N clients thread opens a single connection each, bind and send infinite number of synchronous searches [01/Mar/2022:10:13:05.032462875 -0500] conn=45 fd=106 slot=106 connection from client to server [01/Mar/2022:10:13:05.032717061 -0500] conn=45 op=0 BIND dn="cn=Directory manager" method=128 version=3 [01/Mar/2022:10:13:05.032777631 -0500] conn=45 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000033940 optime=0.000064496 etime=0.000097586 dn="cn=directory manager" [01/Mar/2022:10:13:05.034030702 -0500] conn=46 fd=107 slot=107 connection from client to server [01/Mar/2022:10:13:05.034186763 -0500] conn=46 op=0 BIND dn="cn=Directory manager" method=128 version=3 [01/Mar/2022:10:13:05.034367428 -0500] conn=46 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000028356 optime=0.000172710 etime=0.000198692 dn="cn=directory manager" ... [01/Mar/2022:10:13:06.266385380 -0500] conn=45 op=1798 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06616)" attrs="distinguishedName" [01/Mar/2022:10:13:06.266399608 -0500] conn=46 op=2415 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06620)" attrs="distinguishedName" [01/Mar/2022:10:13:06.266435219 -0500] conn=45 op=1798 RESULT err=0 tag=101 nentries=1 wtime=0.000169448 optime=0.000049963 etime=0.000218492 [01/Mar/2022:10:13:06.266442879 -0500] conn=46 op=2415 RESULT err=0 tag=101 nentries=1 wtime=0.000148119 optime=0.000043390 etime=0.000190653 [01/Mar/2022:10:13:06.266723349 -0500] conn=46 op=2416 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06665)" attrs="distinguishedName" [01/Mar/2022:10:13:06.266748568 -0500] conn=45 op=1799 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06666)" attrs="distinguishedName" [01/Mar/2022:10:13:06.266811203 -0500] conn=45 op=1799 RESULT err=0 tag=101 nentries=1 wtime=0.000194276 optime=0.000062814 etime=0.000256238 [01/Mar/2022:10:13:06.266834464 -0500] conn=46 op=2416 RESULT err=0 tag=101 nentries=1 wtime=0.000170698 optime=0.000111508 etime=0.000280912 ... [01/Mar/2022:10:13:17.188528409 -0500] conn=45 op=24167 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user04034)" attrs="distinguishedName" [01/Mar/2022:10:13:17.279252865 -0500] conn=46 op=25531 RESULT err=0 tag=101 nentries=1 wtime=0.000135211 optime=0.100236121 etime=0.100366851 [01/Mar/2022:10:13:17.288799914 -0500] conn=45 op=24167 RESULT err=0 tag=101 nentries=1 wtime=0.000122219 optime=0.100267877 etime=0.100385760 [01/Mar/2022:10:13:17.383157297 -0500] conn=46 op=-1 fd=107 closed error - B1 [01/Mar/2022:10:13:17.383215892 -0500] conn=45 op=-1 fd=106 closed error - B1 10K use case A single client opens 10K connections, bind and sends a synchronous search on each connection [01/Mar/2022:10:26:18.954343720 -0500] conn=6270 fd=6276 slot=6276 connection from client to server [01/Mar/2022:10:26:18.955272018 -0500] conn=6271 fd=6277 slot=6277 connection from client to server [01/Mar/2022:10:26:18.956261209 -0500] conn=6271 op=0 BIND dn="cn=directory manager" method=128 version=3 [01/Mar/2022:10:26:18.956261361 -0500] conn=6270 op=0 BIND dn="cn=directory manager" method=128 version=3 [01/Mar/2022:10:26:18.956314499 -0500] conn=6271 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000009142 optime=0.000056602 etime=0.000065016 dn="cn=directory manager" [01/Mar/2022:10:26:18.956325437 -0500] conn=6270 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000012630 optime=0.000067354 etime=0.000079061 dn="cn=directory manager" .. [01/Mar/2022:10:26:51.291533809 -0500] conn=6271 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(uid=scarter)" attrs=ALL [01/Mar/2022:10:26:51.291585931 -0500] conn=6271 op=1 RESULT err=0 tag=101 nentries=0 wtime=0.000207695 optime=0.000053214 etime=0.000259783 [01/Mar/2022:10:26:51.296058736 -0500] conn=6272 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(uid=scarter)" attrs=ALL [01/Mar/2022:10:26:51.296110817 -0500] conn=6272 op=1 RESULT err=0 tag=101 nentries=0 wtime=0.000204513 optime=0.000053246 etime=0.000256633 .. [01/Mar/2022:10:27:54.795958317 -0500] conn=6270 op=3 UNBIND [01/Mar/2022:10:27:54.795964540 -0500] conn=6270 op=3 fd=6276 closed error - U1 [01/Mar/2022:10:27:54.796241194 -0500] conn=6271 op=3 UNBIND [01/Mar/2022:10:27:54.796247432 -0500] conn=6271 op=3 fd=6277 closed error - U1 — Reply to this email directly, view it on GitHub<#5182 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AX6KJLWG4QQYLEMOEGI34STU5ZHKDANCNFSM5PLKDRKQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

progier389 · 2022-03-01T19:34:35Z

So If I understand rightly that is:
for i in 0 to N : Open Conn[i] then Bind on Conn[i]
Start of measuring search performance
for i in 0 to N : Send search on Conn[i] then Wait for result on Conn[i]
Collect Search performance result
for i in 0 to N : Unbind on Conn[i]

That is a bit surprising that epolln get slower than poll in that case:
We should gain the setup of the poll fd list and its copyin into the kernel which should be done twice per search

That said, something puzzles me in this fix:
I would have expected that connection_table_move_connection_out_of_active_list and connection_table_move_connection_on_to_active_list get impacted (to stop/start polling on the fd) in poll case it is done by walking the active list before polling (and set up the poll fd list) but that is no more the case
I wonder if the max thread per connection is still enforced ...

Firstyear · 2022-03-02T02:33:23Z

So If I understand rightly that is: for i in 0 to N : Open Conn[i] then Bind on Conn[i] Start of measuring search performance for i in 0 to N : Send search on Conn[i] then Wait for result on Conn[i] Collect Search performance result for i in 0 to N : Unbind on Conn[i]

That is a bit surprising that epolln get slower than poll in that case: We should gain the setup of the poll fd list and its copyin into the kernel which should be done twice per search

Well, I think the issue is here. https://github.com/389ds/389-ds-base/pull/5182/files#diff-d7be8bf18c5d95a1bec17d52c8c2f74e9273250857e64b130bf3718f0027c9a6R1685

We are still treating connections here with epoll as though they are fd's that need to be selected and setup into the table. epoll is a set of system calls which means they are thread safe. We only need to change the epollfd table when a connection is added or removed, and that does not necessitate a full rebuild of the table (via epoll_ctl DEL).

So the way to resolve this, is epoll_pr_idle_pds probably doesn't need to exist as a function. We also don't need an active/inactive list either.

https://github.com/389ds/389-ds-base/pull/5182/files#diff-d7be8bf18c5d95a1bec17d52c8c2f74e9273250857e64b130bf3718f0027c9a6R1714

When we set this flag, at the callsite we should be calling the epoll disarm.

https://github.com/389ds/389-ds-base/pull/5182/files#diff-d7be8bf18c5d95a1bec17d52c8c2f74e9273250857e64b130bf3718f0027c9a6R1732

At the locations we change c_gettingber and thread number, we call epoll arm.

Part of the value of epoll is that we don't need a single thread to manage the arming and disarming, we just need the connections to signal their state during a handling of that connection (from within the worker or reader thread).

This is probably the "biggest" change that will improve our performance is that we aren't bottlenecked on waiting for this thread to cycle around and get our fds ready.

That said, something puzzles me in this fix: I would have expected that connection_table_move_connection_out_of_active_list and connection_table_move_connection_on_to_active_list get impacted (to stop/start polling on the fd) in poll case it is done by walking the active list before polling (and set up the poll fd list) but that is no more the case I wonder if the max thread per connection is still enforced ...

Itu should still be enforced from what I can read?

llile · 2022-03-02T03:42:50Z

@Firstyear, can you link this again or give a line number? For some reason all of the links are just taking me to the top of the diff. I want to make sure I am connecting the correct dots. Especially the following link:

Well, I think the issue is here. https://github.com/389ds/389-ds-base/pull/5182/files#diff-d7be8bf18c5d95a1bec17d52c8c2f74e9273250857e64b130bf3718f0027c9a6R1685

That being said, I will whittle down epoll_pr_idle_pds and redistribute the arm/disarm logic. I had done this previously but since I appeared to be losing track of connections I backed out of that path. I discovered later the connection losses were unrelated.

I believe epoll_pr_idle_pds can be reduced to managing the idle timeouts
... if (c->c_idletimeout > 0 && (curtime - c->c_idlesince) >= c->c_idletimeout && NULL == c->c_ops) {

              ...  if (pagedresults_is_timedout_nolock(c)) {

which could be handled on a longer interval.

I may need some pointers on relocating the c_max_threads_per_conn control and persistent search control logic.

Firstyear · 2022-03-02T03:46:14Z

@Firstyear, can you link this again or give a line number? For some reason all of the links are just taking me to the top of the diff. I want to make sure I am connecting the correct dots. Especially the following link:

Well, I think the issue is here. https://github.com/389ds/389-ds-base/pull/5182/files#diff-d7be8bf18c5d95a1bec17d52c8c2f74e9273250857e64b130bf3718f0027c9a6R1685

Once you click "load diff" on daemon.c it will take you to the correct line.

That being said, I will whittle down epoll_pr_idle_pds and redistribute the arm/disarm logic. I had done this previously but since I appeared to be losing track of connections I backed out of that path. I discovered later the connection losses were unrelated.

Yep, that's a good idea. I'd start with one use case at a time, maybe just for now the "arm" path to rearm a connection, because that will yield the first and most obvious improvement to performance for this change, and you don't have any risk to "lose" connections during a conn close.

I believe epoll_pr_idle_pds can be reduced to managing the idle timeouts ... if (c->c_idletimeout > 0 && (curtime - c->c_idlesince) >= c->c_idletimeout && NULL == c->c_ops) {
              ...  if (pagedresults_is_timedout_nolock(c)) {
which could be handled on a longer interval.

I may need some pointers on relocating the c_max_threads_per_conn control and persistent search control logic.

Yep, that's completely okay, I think you're correct here.

EDIT: IMO once we have re-arming completed in the worker thread, and performance is equal or better, I'd suggest we merge this and then work on reducing pr_idle_pds over time, rather than doing it all at once.

tbordaz · 2022-03-02T11:15:03Z

So If I understand rightly that is: for i in 0 to N : Open Conn[i] then Bind on Conn[i] Start of measuring search performance for i in 0 to N : Send search on Conn[i] then Wait for result on Conn[i] Collect Search performance result for i in 0 to N : Unbind on Conn[i]

Sorry my description was confusing. We have two tools and two runs. The first tool is to monitor search throughput (searchrate). The second tool is to monitor how fast we establish 10K connections and then send few (2) srch on each of 10K connections.

The first tool:
spawn 50 ldap clients. Each of them, open a connection to the server, auth as DM, then hammer the server with synchronous SRCH until the tool is stopped. So if a client sends 1000 SRCH, it opens one connection, bind once and send 1000 SRCH on that connection.

The second tool:
a single ldap client, opens 10K connections and bind as DM on each on them. Then for each of the 10K established connection, it sends a SRCH. Then unbind all connections

progier389 · 2022-03-02T14:47:11Z

Sorry about the noise upon the active table (I was tricked by my memory: I wrongly thought the active table was the polled connection while it is the open connection).
about c_gettingber and c_threadnumber test I think that you should handle that when these values are changed in connection.c
by switching the POLLIN flag with epoll_ctl EPOLL_CTL_MOD
set POLLING if !c_gettingber && c_threadnumber < maxthreads
(Note: in most cases the case about the state transition is already in the code (when gettingber is changed or threadnumber modified) and you will just have to call a function that set/unset POLLING flag.

llile · 2022-03-16T12:42:49Z

ldap/servers/slapd/daemon.c

+
+        num_poll = epoll_wait(the_connection_table->epollfd, ct_epoll_events, connection_table_size, slapd_wakeup_timer);
+
+        // XXX This makes absolutely no sense, if this following slapi_log_err statement is removed or silenced


While cleaning up my code to further discussion, I discovered that removing this slapi_log_err or silencing it causes an order of magnitude drop in performance from ~4k/s to ~.6k/s in my testing. I'm completely befuddled by this, can someone point me in the right direction?

C is not a "linear" language. The code here isn't all executing one line at a time, your compiler is free to re-arrange things provided no data dependency conflicts, and your cpu will pipeline and run multiple instructions simultaneously, and in any order it wishes.

One rule in a compiler is about code hoisting/sinking where statements can occur. An if condition is a point in the code where the compiler may NOT re-arrange code and certain statement orders creating ordering. If you remove the slapi_log_err, then you create a scenario where the if condition is dead-code eliminated, meaning that other parts CAN be re-arranged. Without looking at the asm that's emitted and different between the two it's hard to know "why" this is causing such a huge performance drop.

I suspect if you replaced the if condition with:

__atomic_thread_fence(__ATOMIC_ACQ_REL)

This should cause the "code either side of this fence to remain in those locations". But that also may have it's own issues as well. I've been trying to find a way to prevent code hoist/sink without atomic's but I'm not sure it's possible.

This is just one of those lovely examples of why C is not a good language in any way.

Anyway, another option could be instead of the switch statement, to use an if like this:

if (num_poll > 0) { } else if (num_poll < 0 ) { }

That also may end up working nicer than the case switch, especially WRT to optimisation since the "common" case is the first one. You can also then move the if conn_count != last_count inside that loop too.

Anyway, a few ideas for you :)

@Firstyear I understand your comments about compiler optimizations, pipelining and the resulting out of order execution that can result however I cannot devise any scenario where reordering the code would result in this performance loss without violating data dependency.

However, I have rewritten the suspect code segment in various different ways, changing the semantics between if-then/case, nesting/de-nesting operations, etc. I can find no solution that preserves the performance without the call to slapi_log_err. The location of the call to slapi_log_err and message are irrelevant. At any point within the while(!g_get_shutdown) loop emitting any message from slapi_log_err corrects the performance regression.

This brings me to the conclusion that this must be a thread yield or thread scheduling problem that is resolved by the action of slapi_log_err local to the thread, which makes me think the thread local I/O is waking/scheduling the thread. Based on that I am beginning to suspect that moving my code away from the signalpipe semantic was not a correct decision. I am going to pull the signalpipe semantic back into the epoll code to confirm this suspicion.

Adding signalpipe to the connection table polling did nothing to improve the regression.

I'm at a complete loss.

If you suspect it's a thread yield issue, can you try pthread_yield() ?

@llile Hey there, I've started to look into this a bit. Looking at my decompiled outputs, I don't see any difference in the results beside the added log output - no weird co-arrangement etc. Even the same variables and offsets.

Can you provide me:

What compiler version you are using

What build settings you are using

a copy of build/.libs/ns-slapd (should be the ELF file) from the "with" and "without" log message variants so that I can decompile and help investigate?

A copy of libslapd.so (should also be in .libs) from both builds. Should also be an ELF

EDIT: Saying this it's quite interesting as it appears that clang is emitting a much better connection loop compared to gcc, as gcc is putting in the g_get_shutdown() assuming it's "likely true" so all the shutdown components are preceeding the epoll wait, where as clang assumes it's "likely false" so it has shutdown at the end as per the C files. We might need to change this too:

while (!g_get_shutdown()) { TO while ( __builtin_expect( !g_get_shutdown(), !0)) {

HI @Firstyear, sorry for the delayed response.

The compiler is gcc-11.2.1-7.fc35.x86_64 and the configure settings are $ ./configure --exec-prefix=/usr --prefix=/ --datarootdir=/usr/share --datadir=/usr/share --libdir=/usr/lib64 --disable-asan --disable-tsan --enable-cmocka --with-systemd --with-selinux --with-openldap

What is the best method to provide you with the binaries?

Have you made any discoveries?

I think that google drive or some other drop box would work. Alternately you can tar them into a PR or similar for github. Finally, if they compress to small than 10 MB you can email them to me at william.brown at suse.com :)

I did discover a bit in the reversing process but nothing obvious that would cause the delay. That's why I'm curious to see your build outputs. I'll also try to build these myself today too and compare once you send me your binaries too.

I've had an initial look. My test results with: /opt/dirsrv/bin/ldclt -h localhost -p 3389 -N 10 -n 4 -e bindeach -e bindonly. I avoided search paths so that this would stress only anonymous (fully in memory) binds and connections only.

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811 GCC w_out log line: ldclt[22341]: Global average rate: 8829269.00/thr (353170.75/sec), total: 35317075 Clang w_ log line: ldclt[840]: Global average rate: 9532904.00/thr (381316.16/sec), total: 38131614 Clang w_out log line: ldclt[27872]: Global average rate: 9507196.00/thr (380287.84/sec), total: 38028785

The two clang results are within +-0.2% which means I perceive this to be noise. The GCC result however is +- 3%, where the presence of the log line increases performance by 3%. This is still a trivial amount IMO, but there is certainly more weight to the fact that there is a code generation issue with GCC related to the presence or absence of this log line, that does not affect clang.

I'm not seeing the 20x kind of number that you are reporting though, so I'm wondering if there is something else going on in your test methodology, as my results show it as much closer. Regardless, there is a perceptable difference with the log line with GCC.

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811 GCC w_out log line: ldclt[22341]: Global average rate: 8829269.00/thr (353170.75/sec), total: 35317075 GCC w trace line: Global average rate: 8508577.00/thr (340343.09/sec), total: 34034307

Now if we change the log level to TRACE (so it emits nothing, but still is present) this shows that the performance decreases. This means that it is unlikely to be a code placement/alignment issue, but that some behaviour in ldap/servers/slapd/log.c slapi_log_error, after the loglevel check is influencing performance, likely be affecting cache or memory behaviour. Reading that function we can see there are two likely causes. Either the call to acquire the lock, or the call to issue the printf to the console. I attempted removing each of them in seperate builds.

Removing the either caused the performance to rise to where we expect, which makes me think it is the issuance of the syscall to either printf or lock that is the cause of the performance rise.

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811 GCC w_out printf: ldclt[3793]: Global average rate: 9260484.00/thr (370419.38/sec), total: 37041935 GCC w_out lock: ldclt[10884]: Global average rate: 9313178.00/thr (372527.12/sec), total: 37252714

But if we remove the call to slapd_log_error_proc_internal then the performance increases to the rates we saw without the lock.

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811 GCC w_out log_err ldclt[25077]: Global average rate: 9211057.00/thr (368442.28/sec), total: 36844228

This starts to make me think that there is a data dependence or code re-arrangement that's going on by having the call to slapi_log_error present, which impacts the behaviour of the branch predictor potentially. Anyway, so I finally tried putting in the expect and removing the log line:

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811 GCC w_ log + expect: ldclt[32169]: Global average rate: 9226223.00/thr (369048.91/sec), total: 36904892 GCC w_out log + expect: ldclt[6841]: Global average rate: 9100628.00/thr (364025.12/sec), total: 36402510

Adding the builtin expect line as suggested gets us up to the performance with the log line, but the log line still is increasing our performance.

Next I tried just simply adding a printf(""); instead of the log line.

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811 GCC w_ log + expect: ldclt[32169]: Global average rate: 9226223.00/thr (369048.91/sec), total: 36904892 GCC w_ printf + expect: ldclt[13949]: Global average rate: 9437884.00/thr (377515.38/sec), total: 37751534

This is starting to get up near clang performance! So we know that it's not the log itself, but likely the act of having the syscall that is helping somehow. That's probably also why the interaction with epoll_pr_idle exists.

Trying with sched yield instead of printf/slapi_log_err():

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr (364218.12/sec), total: 36421811 GCC w_ printf + expect: ldclt[13949]: Global average rate: 9437884.00/thr (377515.38/sec), total: 37751534 GCC w_ yield + expect: ldclt[28141]: Global average rate: 9817390.00/thr (392695.59/sec), total: 39269562

There we go! Now we are seeing some gains! The syscall to printf or lock() was likely allowing a yield point for the scheduler, so instead we can manually add one - and it's much cleaner :)

Now curiously, if we use clang with yield + expect it actually lowers performance. Again, likely just related to clang emitting better code.

So ultimately the changes needed to get GCC to clang's performance are:

diff --git a/ldap/servers/slapd/daemon.c b/ldap/servers/slapd/daemon.c index 6f5d1f6ca..eb9f6f471 100644 --- a/ldap/servers/slapd/daemon.c +++ b/ldap/servers/slapd/daemon.c @@ -55,6 +55,7 @@ #include <private/pprio.h> #include <ssl.h> #include "fe.h" +#include <sched.h> #if defined(LDAP_IOCP) #define SLAPD_WAKEUP_TIMER 250 @@ -1156,12 +1157,22 @@ slapd_daemon(daemon_ports_t *ports) #endif /* The meat of the operation is in a loop on a call to select */ - while (!g_get_shutdown()) { + while ( +#ifdef __clang__ + !g_get_shutdown() +#else + __builtin_expect(!g_get_shutdown(), !0) +#endif + ) { #ifdef ENABLE_EPOLL // XXX emitting any message between here and #else, provides 20x performance improvement // XXX except as noted below near epoll_pr_idle - slapi_log_err(SLAPI_LOG_ERR, "slapd_daemon", "epoll mystery\n"); +#ifndef __clang__ + sched_yield(); +#endif // XXX There appears to be a slight performance improvment with if/else vs. case as firstyear predicted if ((num_poll = epoll_wait(the_connection_table->epollfd, ct_epoll_events, connection_table_size, slapd_wakeup_timer)) > 0) {

llile · 2022-03-17T19:39:03Z

I have tidied up the code, noting the performance anomaly with XXX markers. I do not comprehend this issue.

@tbordaz could you possibly analyze this version versus 2.x?

tbordaz · 2022-03-18T08:08:14Z

@llile, I am really sorry for this late feedback. I will rerun the tests on the latest version and describe the details of the test and what I am observing. My test environment got cleared and I had to recreate it.

tbordaz · 2022-03-21T14:12:51Z

The performance (throughput and response time) of recent patch are similar as with the first patch. With epoll, throughput drops by 40% and response time increase by 60%. Using ping searchrate tool.

389-ds-base-2.0
ldapbench/bin/searchrate --hostname server --port 389 --bindDN "cn=Directory manager" --bindPassword password --baseDN dc=example,dc=com --scope sub --filter "(uid=user[0:10000%00000])" --attribute dn --numThreads 50 -I 5
      Recent       Recent       Recent       Recent      Overall      Overall
Searches/Sec   Avg Dur ms Entries/Srch   Errors/Sec Searches/Sec   Avg Dur ms
------------ ------------ ------------ ------------ ------------ ------------
  150977.257        0.329        1.000        0.000   150977.218        0.329
  154457.337        0.323        1.000        0.000   152717.236        0.326
  153951.556        0.324        1.000        0.000   153128.693        0.325
  156661.714        0.318        1.000        0.000   154011.973        0.323
  158359.078        0.315        1.000        0.000   154881.280        0.322

389-ds-base-2.0 + epoll
ldapbench/bin/searchrate --hostname server --port 389 --bindDN "cn=Directory manager" --bindPassword password --baseDN dc=example,dc=com --scope sub --filter "(uid=user[0:10000%00000])" --attribute dn --numThreads 50 -I 5
      Recent       Recent       Recent       Recent      Overall      Overall
Searches/Sec   Avg Dur ms Entries/Srch   Errors/Sec Searches/Sec   Avg Dur ms
------------ ------------ ------------ ------------ ------------ ------------
   97315.923        0.512        1.000        0.000    97315.899        0.512
   97074.039        0.513        1.000        0.000    97194.966        0.513
   93545.599        0.534        1.000        0.000    95978.656        0.519
   94132.058        0.530        1.000        0.000    95516.981        0.522
   89023.958        0.560        1.000        0.000    94218.316        0.529

I installed a standalone instance, with very large dbcache and entry cache store all DB entries and indexes. I will attach the database ldif file

tbordaz · 2022-03-21T14:14:36Z

users.ldif.final.gz

tbordaz · 2023-08-21T07:39:51Z

This PR is likely impacted by multi-listener, removal turbo mode and dynamic thread management. Code changed but also new set of tests are available.
Need to reevaluate this PR with those changes and see if epoll performance could be an alternative.

Firstyear · 2023-08-21T07:42:52Z

100% agree, I think epoll should be looked at next and then the remaining bottlenecks found and cleaned up :)

Firstyear requested changes Feb 28, 2022

View reviewed changes

llile force-pushed the i5181 branch from 6efa77d to c789d70 Compare February 28, 2022 23:54

llile force-pushed the i5181 branch from c789d70 to abfc00e Compare March 16, 2022 12:36

llile commented Mar 16, 2022

View reviewed changes

Add epoll support for issue 5181

9f3fe32

llile force-pushed the i5181 branch from abfc00e to 9f3fe32 Compare March 17, 2022 19:36

tbordaz mentioned this pull request Mar 30, 2022

Issue 4812 - Scalability with high number of connections #5090

Merged

jchapma mentioned this pull request Aug 4, 2023

Remove turbo mode from the code base #5852

Open


		num_poll = epoll_wait(the_connection_table->epollfd, ct_epoll_events, connection_table_size, slapd_wakeup_timer);

		// XXX This makes absolutely no sense, if this following slapi_log_err statement is removed or silenced

Add epoll support for issue 5181 #5182

Are you sure you want to change the base?

Add epoll support for issue 5181 #5182

Conversation

llile commented Feb 25, 2022 • edited by tbordaz Loading

mreynolds389 commented Feb 25, 2022

llile commented Feb 25, 2022

llile commented Feb 25, 2022

mreynolds389 commented Feb 25, 2022

Firstyear left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbordaz commented Feb 28, 2022

llile commented Mar 1, 2022

tbordaz commented Mar 1, 2022

progier389 commented Mar 1, 2022

tbordaz commented Mar 1, 2022

llile commented Mar 1, 2022 via email

progier389 commented Mar 1, 2022

Firstyear commented Mar 2, 2022

llile commented Mar 2, 2022

Firstyear commented Mar 2, 2022 • edited Loading

tbordaz commented Mar 2, 2022

progier389 commented Mar 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Firstyear Mar 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Firstyear Apr 7, 2022 • edited Loading

Choose a reason for hiding this comment

llile commented Mar 17, 2022

tbordaz commented Mar 18, 2022

tbordaz commented Mar 21, 2022

tbordaz commented Mar 21, 2022

tbordaz commented Aug 21, 2023

Firstyear commented Aug 21, 2023

llile commented Feb 25, 2022 •

edited by tbordaz

Loading

Firstyear commented Mar 2, 2022 •

edited

Loading

Firstyear Mar 21, 2022 •

edited

Loading

Firstyear Apr 7, 2022 •

edited

Loading