Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add epoll support for issue 5181 #5182

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add epoll support for issue 5181 #5182

wants to merge 1 commit into from

Conversation

llile
Copy link

@llile llile commented Feb 25, 2022

Pull request for issue 5181

Relates: #5181

@mreynolds389
Copy link
Contributor

Great to see this back on the table! We did some initial testing with the old patch (which you mentioned had some bugs in it, incorrect use of sizeof, etc), and it was actually performing worse than our existing connection code. So we look forward to testing this latest patch. @tbordaz did that initial testing, and perhaps he can share with you the tooling we used so you can play around with it as well.

@llile
Copy link
Author

llile commented Feb 25, 2022

Hi,

Can someone link this to issue 5181? I can't seem to figure out how to do that.

Thanks.

--Larry

@llile
Copy link
Author

llile commented Feb 25, 2022

Sorry for the delay in getting the PR back online, but all the paperwork and approvals are done.

I look forward to any feedback from you all and can hopefully help address any issue that are discovered. If epoll isn't providing any performance or architectural benefit that is ok, at least the option was explored.

Great to see this back on the table! We did some initial testing with the old patch (which you mentioned had some bugs in it, incorrect use of sizeof, etc), and it was actually performing worse than our existing connection code. So we look forward to testing this latest patch. @tbordaz did that initial testing, and perhaps he can share with you the tooling we used so you can play around with it as well.

@mreynolds389
Copy link
Contributor

Sorry for the delay in getting the PR back online, but all the paperwork and approvals are done.

I look forward to any feedback from you all and can hopefully help address any issue that are discovered. If epoll isn't providing any performance or architectural benefit that is ok, at least the option was explored.

Moving to epoll is the right thing to do, it should technically be able to service things faster - it just might require more code changes :-)

Copy link
Contributor

@Firstyear Firstyear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks really good. A few small details, and one larger change in here, but it looks really good. Ithink the fact that it's behind an ifdef is excellent, because that means we should consider merging it sooner than later so that we can start to more extensively test this, and we have a fast rollback path if something goes sideways. Great work!

ldap/servers/slapd/daemon.c Outdated Show resolved Hide resolved
ldap/servers/slapd/daemon.c Show resolved Hide resolved
ldap/servers/slapd/daemon.c Show resolved Hide resolved
ldap/servers/slapd/daemon.c Show resolved Hide resolved
/* we try to acquire the connection mutex, if it is already
* acquired by another thread, don't wait
*/
if (pthread_mutex_trylock(&(c->c_mutex)) == EBUSY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could consider doing rc = trylock, then check rc != 0, and if it's a weird code log that (weird being not EBUSY, see man 3 pthread_mutex_trylock )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates per suggestion, please review new update.

ldap/servers/slapd/daemon.c Show resolved Hide resolved
@@ -4792,6 +4822,25 @@ config_set_maxdescriptors(const char *attrname, char *value, char *errorbuf, int
}
}

#ifdef LINUX
#ifdef ENABLE_EPOLL
maxVal = (int64_t)slapdFrontendConfig->maxuserwatches;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add an extra setting here. maxuserwatches is always going to have to be at least equal or greater than conntable, meaning that we already have a setting for this as a limit. So I'd say strip out all the maxuserwatches stuff and check that conntable size is =< fs.maxuserwatches and that's it. Then if conntable > fs.maxuserwaches, then clamp conntable to fs.maxuserwatches.

In general "adding more tunables" is not a good use of time/effort since most tunables are never altered or touched.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the assessment about creating excess tunables. I didn't actually create a new tunable. max_user_watches is stored in slapdFrontendConfig so that it can be accessed from multiple locations. At all of the locations where related tunables are configured, they are clamped to the smaller of the specific tunable or max_user_watches. This seemed the best solution given the number of tunables that clamp the connection table size.

I was concerned that without adding the other clamps it might be confusing for administrators to be denied a tunable change with little or no obvious explanation.

If it is preferable, I could remove all clamps except for the one in config_set_conntablesize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO those checking looks valid to me. Before setting descriptors, conntable-sz, reserveddesc it checks the set values are aligned with max_user_watches, that looks safe and verbose enough to help the admin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in that case please remove the tunable for maxuserwatches, and just check that it's value wtih the conntable size :)

@tbordaz
Copy link
Contributor

tbordaz commented Feb 28, 2022

The Patch looks good to me.. and really nice.

I did some performance tests but unexpectedly saw no gain and even some loss (vs 2.0)
Measure of throughput (subtree search with indexed search returning 1 entry):

**With #5181 patch**
./perf_test.sh 55 50 (listener 99% CPU)
(20% slower than regular)
      Recent       Recent       Recent       Recent      Overall      Overall
Searches/Sec   Avg Dur ms Entries/Srch   Errors/Sec Searches/Sec   Avg Dur ms
------------ ------------ ------------ ------------ ------------ ------------
  113174.013        0.483        1.000        0.000   113173.981        0.483
  118207.979        0.464        1.000        0.000   115690.977        0.473
  114818.793        0.477        1.000        0.000   115400.277        0.474
  114730.341        0.478        1.000        0.000   115232.784        0.475

**389-ds-2.0**
./perf_test.sh 55 50
        accept_thread => 0%   (new cnx)
        slapd_daemon  => 99.2%  (new op)
      Recent       Recent       Recent       Recent      Overall      Overall
Searches/Sec   Avg Dur ms Entries/Srch   Errors/Sec Searches/Sec   Avg Dur ms
------------ ------------ ------------ ------------ ------------ ------------
  143122.319        0.379        1.000        0.000   143122.281        0.379
  147573.311        0.371        1.000        0.000   145347.574        0.375
  146279.148        0.375        1.000        0.000   145658.129        0.375

So in terms of throughput, epoll version is ~20% slower, in terms of CPU consumption of listener it is equivalent.

Measure of 10K conn

**epoll.5181**
10000 connections created
2 search on 10000 connections, complete.
Unbind complete for 10000 connections

real    1m43.165s
user    0m3.541s
sys     0m2.804s

**389-ds2.0**
10000 connections created
2 search on 10000 connections, complete.
Unbind complete for 10000 connections
        accept_thread 9392 = 4%
        slapd_daemon 9351 =  0%

        accept_thread 9392 = 0%
        slapd_daemon 9351 = 99.3%

real    1m3.962s
user    0m3.384s
sys     0m2.438s

So in terms of 10K established connection, epoll is ~60% slower than 2.0 and equivalent in terms of CPU consumption

@llile
Copy link
Author

llile commented Mar 1, 2022

The Patch looks good to me.. and really nice.

I did some performance tests but unexpectedly saw no gain and even some loss (vs 2.0) Measure of throughput (subtree search with indexed search returning 1 entry):
So in terms of throughput, epoll version is ~20% slower, in terms of CPU consumption of listener it is equivalent.
So in terms of 10K established connection, epoll is ~60% slower than 2.0 and equivalent in terms of CPU consumption

@tbordaz thank you for the quantifying the performance of the epoll patch. I'm obviously very disappointed with the result. I don't see any obvious issue with the implementation but I will review the code again. If you, or anyone else, has any suggestion as to the failure, I would greatly appreciate your feedback.

I suspect the issue may be with epoll_pr_idle_pds and connections being transitioned to closure, but this is nothing more than a suspicion.

@tbordaz
Copy link
Contributor

tbordaz commented Mar 1, 2022

The Patch looks good to me.. and really nice.

I did some performance tests but unexpectedly saw no gain and even some loss (vs 2.0) Measure of throughput (subtree search with indexed search returning 1 entry):
So in terms of throughput, epoll version is ~20% slower, in terms of CPU consumption of listener it is equivalent.
So in terms of 10K established connection, epoll is ~60% slower than 2.0 and equivalent in terms of CPU consumption

@tbordaz thank you for the quantifying the performance of the epoll patch. I'm obviously very disappointed with the result. I don't see any obvious issue with the implementation but I will review the code again. If you, or anyone else, has any suggestion as to the failure, I would greatly appreciate your feedback.

I suspect the issue may be with epoll_pr_idle_pds and connections being transitioned to closure, but this is nothing more than a suspicion.

I have no idea what is could be related to that loss. I see epoll_pr_idle_pds not very different that setup_pr_read_pds going through the active connections. My initial thought was that epoll was somehow transferring CPU userspace (scanning the poll array) to the kernelspace (scanning the poll array to prepare read_ready array), so I was initially thinking of similar overall perf epoll vs poll. It is also disappointing to see loss. It would require more tests to identify where time is spent.

The overall patch is (very) good to me. I do not know how to progress from that. Should we merge the patch with ifdef, then eventually change the ifdef with a config toggle ?

Another option split the patch between accept_thread and listener. My tests were showing (TBC) equivalent perf with accept_thread epoll vs poll. So we could merge accept_thread to epoll and #ifdef listener part.

@progier389
Copy link
Contributor

Just curious: @tbordaz what kind of connection the performance tool is using ? (short term ? or persistent ?)
That should help to understand which part of the code are you stressing (the accept() handling or the standard operation ? )

@tbordaz
Copy link
Contributor

tbordaz commented Mar 1, 2022

Just curious: @tbordaz what kind of connection the performance tool is using ? (short term ? or persistent ?) That should help to understand which part of the code are you stressing (the accept() handling or the standard operation ? )

@progier389 here are the description of the two tests

Search througput:
	N clients thread opens a single connection each, bind and send infinite number of synchronous searches
[01/Mar/2022:10:13:05.032462875 -0500] conn=45 fd=106 slot=106 connection from client to server
[01/Mar/2022:10:13:05.032717061 -0500] conn=45 op=0 BIND dn="cn=Directory manager" method=128 version=3
[01/Mar/2022:10:13:05.032777631 -0500] conn=45 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000033940 optime=0.000064496 etime=0.000097586 dn="cn=directory manager"
[01/Mar/2022:10:13:05.034030702 -0500] conn=46 fd=107 slot=107 connection from client to server
[01/Mar/2022:10:13:05.034186763 -0500] conn=46 op=0 BIND dn="cn=Directory manager" method=128 version=3
[01/Mar/2022:10:13:05.034367428 -0500] conn=46 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000028356 optime=0.000172710 etime=0.000198692 dn="cn=directory manager"
...
[01/Mar/2022:10:13:06.266385380 -0500] conn=45 op=1798 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06616)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266399608 -0500] conn=46 op=2415 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06620)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266435219 -0500] conn=45 op=1798 RESULT err=0 tag=101 nentries=1 wtime=0.000169448 optime=0.000049963 etime=0.000218492
[01/Mar/2022:10:13:06.266442879 -0500] conn=46 op=2415 RESULT err=0 tag=101 nentries=1 wtime=0.000148119 optime=0.000043390 etime=0.000190653
[01/Mar/2022:10:13:06.266723349 -0500] conn=46 op=2416 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06665)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266748568 -0500] conn=45 op=1799 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user06666)" attrs="distinguishedName"
[01/Mar/2022:10:13:06.266811203 -0500] conn=45 op=1799 RESULT err=0 tag=101 nentries=1 wtime=0.000194276 optime=0.000062814 etime=0.000256238
[01/Mar/2022:10:13:06.266834464 -0500] conn=46 op=2416 RESULT err=0 tag=101 nentries=1 wtime=0.000170698 optime=0.000111508 etime=0.000280912
...
[01/Mar/2022:10:13:17.188528409 -0500] conn=45 op=24167 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user04034)" attrs="distinguishedName"
[01/Mar/2022:10:13:17.279252865 -0500] conn=46 op=25531 RESULT err=0 tag=101 nentries=1 wtime=0.000135211 optime=0.100236121 etime=0.100366851
[01/Mar/2022:10:13:17.288799914 -0500] conn=45 op=24167 RESULT err=0 tag=101 nentries=1 wtime=0.000122219 optime=0.100267877 etime=0.100385760
[01/Mar/2022:10:13:17.383157297 -0500] conn=46 op=-1 fd=107 closed error - B1
[01/Mar/2022:10:13:17.383215892 -0500] conn=45 op=-1 fd=106 closed error - B1

10K use case
	A single client opens 10K connections, bind and sends a synchronous search on each connection

[01/Mar/2022:10:26:18.954343720 -0500] conn=6270 fd=6276 slot=6276 connection from client to server
[01/Mar/2022:10:26:18.955272018 -0500] conn=6271 fd=6277 slot=6277 connection from client to server
[01/Mar/2022:10:26:18.956261209 -0500] conn=6271 op=0 BIND dn="cn=directory manager" method=128 version=3
[01/Mar/2022:10:26:18.956261361 -0500] conn=6270 op=0 BIND dn="cn=directory manager" method=128 version=3
[01/Mar/2022:10:26:18.956314499 -0500] conn=6271 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000009142 optime=0.000056602 etime=0.000065016 dn="cn=directory manager"
[01/Mar/2022:10:26:18.956325437 -0500] conn=6270 op=0 RESULT err=0 tag=97 nentries=0 wtime=0.000012630 optime=0.000067354 etime=0.000079061 dn="cn=directory manager"
..
[01/Mar/2022:10:26:51.291533809 -0500] conn=6271 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(uid=scarter)" attrs=ALL
[01/Mar/2022:10:26:51.291585931 -0500] conn=6271 op=1 RESULT err=0 tag=101 nentries=0 wtime=0.000207695 optime=0.000053214 etime=0.000259783
[01/Mar/2022:10:26:51.296058736 -0500] conn=6272 op=1 SRCH base="dc=example,dc=com" scope=2 filter="(uid=scarter)" attrs=ALL
[01/Mar/2022:10:26:51.296110817 -0500] conn=6272 op=1 RESULT err=0 tag=101 nentries=0 wtime=0.000204513 optime=0.000053246 etime=0.000256633
..
[01/Mar/2022:10:27:54.795958317 -0500] conn=6270 op=3 UNBIND
[01/Mar/2022:10:27:54.795964540 -0500] conn=6270 op=3 fd=6276 closed error - U1
[01/Mar/2022:10:27:54.796241194 -0500] conn=6271 op=3 UNBIND
[01/Mar/2022:10:27:54.796247432 -0500] conn=6271 op=3 fd=6277 closed error - U1

@llile
Copy link
Author

llile commented Mar 1, 2022 via email

@progier389
Copy link
Contributor

So If I understand rightly that is:
for i in 0 to N : Open Conn[i] then Bind on Conn[i]
Start of measuring search performance
for i in 0 to N : Send search on Conn[i] then Wait for result on Conn[i]
Collect Search performance result
for i in 0 to N : Unbind on Conn[i]

That is a bit surprising that epolln get slower than poll in that case:
We should gain the setup of the poll fd list and its copyin into the kernel which should be done twice per search

That said, something puzzles me in this fix:
I would have expected that connection_table_move_connection_out_of_active_list and connection_table_move_connection_on_to_active_list get impacted (to stop/start polling on the fd) in poll case it is done by walking the active list before polling (and set up the poll fd list) but that is no more the case
I wonder if the max thread per connection is still enforced ...

@Firstyear
Copy link
Contributor

So If I understand rightly that is: for i in 0 to N : Open Conn[i] then Bind on Conn[i] Start of measuring search performance for i in 0 to N : Send search on Conn[i] then Wait for result on Conn[i] Collect Search performance result for i in 0 to N : Unbind on Conn[i]

That is a bit surprising that epolln get slower than poll in that case: We should gain the setup of the poll fd list and its copyin into the kernel which should be done twice per search

Well, I think the issue is here. https://github.com/389ds/389-ds-base/pull/5182/files#diff-d7be8bf18c5d95a1bec17d52c8c2f74e9273250857e64b130bf3718f0027c9a6R1685

We are still treating connections here with epoll as though they are fd's that need to be selected and setup into the table. epoll is a set of system calls which means they are thread safe. We only need to change the epollfd table when a connection is added or removed, and that does not necessitate a full rebuild of the table (via epoll_ctl DEL).

So the way to resolve this, is epoll_pr_idle_pds probably doesn't need to exist as a function. We also don't need an active/inactive list either.

https://github.com/389ds/389-ds-base/pull/5182/files#diff-d7be8bf18c5d95a1bec17d52c8c2f74e9273250857e64b130bf3718f0027c9a6R1714

When we set this flag, at the callsite we should be calling the epoll disarm.

https://github.com/389ds/389-ds-base/pull/5182/files#diff-d7be8bf18c5d95a1bec17d52c8c2f74e9273250857e64b130bf3718f0027c9a6R1732

At the locations we change c_gettingber and thread number, we call epoll arm.

Part of the value of epoll is that we don't need a single thread to manage the arming and disarming, we just need the connections to signal their state during a handling of that connection (from within the worker or reader thread).

This is probably the "biggest" change that will improve our performance is that we aren't bottlenecked on waiting for this thread to cycle around and get our fds ready.

That said, something puzzles me in this fix: I would have expected that connection_table_move_connection_out_of_active_list and connection_table_move_connection_on_to_active_list get impacted (to stop/start polling on the fd) in poll case it is done by walking the active list before polling (and set up the poll fd list) but that is no more the case I wonder if the max thread per connection is still enforced ...

Itu should still be enforced from what I can read?

@llile
Copy link
Author

llile commented Mar 2, 2022

@Firstyear, can you link this again or give a line number? For some reason all of the links are just taking me to the top of the diff. I want to make sure I am connecting the correct dots. Especially the following link:

Well, I think the issue is here. https://github.com/389ds/389-ds-base/pull/5182/files#diff-d7be8bf18c5d95a1bec17d52c8c2f74e9273250857e64b130bf3718f0027c9a6R1685

That being said, I will whittle down epoll_pr_idle_pds and redistribute the arm/disarm logic. I had done this previously but since I appeared to be losing track of connections I backed out of that path. I discovered later the connection losses were unrelated.

I believe epoll_pr_idle_pds can be reduced to managing the idle timeouts
... if (c->c_idletimeout > 0 && (curtime - c->c_idlesince) >= c->c_idletimeout && NULL == c->c_ops) {

              ...  if (pagedresults_is_timedout_nolock(c)) {

which could be handled on a longer interval.

I may need some pointers on relocating the c_max_threads_per_conn control and persistent search control logic.

@Firstyear
Copy link
Contributor

Firstyear commented Mar 2, 2022

@Firstyear, can you link this again or give a line number? For some reason all of the links are just taking me to the top of the diff. I want to make sure I am connecting the correct dots. Especially the following link:

Well, I think the issue is here. https://github.com/389ds/389-ds-base/pull/5182/files#diff-d7be8bf18c5d95a1bec17d52c8c2f74e9273250857e64b130bf3718f0027c9a6R1685

Once you click "load diff" on daemon.c it will take you to the correct line.

That being said, I will whittle down epoll_pr_idle_pds and redistribute the arm/disarm logic. I had done this previously but since I appeared to be losing track of connections I backed out of that path. I discovered later the connection losses were unrelated.

Yep, that's a good idea. I'd start with one use case at a time, maybe just for now the "arm" path to rearm a connection, because that will yield the first and most obvious improvement to performance for this change, and you don't have any risk to "lose" connections during a conn close.

I believe epoll_pr_idle_pds can be reduced to managing the idle timeouts ... if (c->c_idletimeout > 0 && (curtime - c->c_idlesince) >= c->c_idletimeout && NULL == c->c_ops) {

              ...  if (pagedresults_is_timedout_nolock(c)) {

which could be handled on a longer interval.

I may need some pointers on relocating the c_max_threads_per_conn control and persistent search control logic.

Yep, that's completely okay, I think you're correct here.

EDIT: IMO once we have re-arming completed in the worker thread, and performance is equal or better, I'd suggest we merge this and then work on reducing pr_idle_pds over time, rather than doing it all at once.

@tbordaz
Copy link
Contributor

tbordaz commented Mar 2, 2022

So If I understand rightly that is: for i in 0 to N : Open Conn[i] then Bind on Conn[i] Start of measuring search performance for i in 0 to N : Send search on Conn[i] then Wait for result on Conn[i] Collect Search performance result for i in 0 to N : Unbind on Conn[i]

Sorry my description was confusing. We have two tools and two runs. The first tool is to monitor search throughput (searchrate). The second tool is to monitor how fast we establish 10K connections and then send few (2) srch on each of 10K connections.

The first tool:
spawn 50 ldap clients. Each of them, open a connection to the server, auth as DM, then hammer the server with synchronous SRCH until the tool is stopped. So if a client sends 1000 SRCH, it opens one connection, bind once and send 1000 SRCH on that connection.

The second tool:
a single ldap client, opens 10K connections and bind as DM on each on them. Then for each of the 10K established connection, it sends a SRCH. Then unbind all connections

@progier389
Copy link
Contributor

Sorry about the noise upon the active table (I was tricked by my memory: I wrongly thought the active table was the polled connection while it is the open connection).
about c_gettingber and c_threadnumber test I think that you should handle that when these values are changed in connection.c
by switching the POLLIN flag with epoll_ctl EPOLL_CTL_MOD
set POLLING if !c_gettingber && c_threadnumber < maxthreads
(Note: in most cases the case about the state transition is already in the code (when gettingber is changed or threadnumber modified) and you will just have to call a function that set/unset POLLING flag.


num_poll = epoll_wait(the_connection_table->epollfd, ct_epoll_events, connection_table_size, slapd_wakeup_timer);

// XXX This makes absolutely no sense, if this following slapi_log_err statement is removed or silenced
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While cleaning up my code to further discussion, I discovered that removing this slapi_log_err or silencing it causes an order of magnitude drop in performance from ~4k/s to ~.6k/s in my testing. I'm completely befuddled by this, can someone point me in the right direction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C is not a "linear" language. The code here isn't all executing one line at a time, your compiler is free to re-arrange things provided no data dependency conflicts, and your cpu will pipeline and run multiple instructions simultaneously, and in any order it wishes.

One rule in a compiler is about code hoisting/sinking where statements can occur. An if condition is a point in the code where the compiler may NOT re-arrange code and certain statement orders creating ordering. If you remove the slapi_log_err, then you create a scenario where the if condition is dead-code eliminated, meaning that other parts CAN be re-arranged. Without looking at the asm that's emitted and different between the two it's hard to know "why" this is causing such a huge performance drop.

I suspect if you replaced the if condition with:

__atomic_thread_fence(__ATOMIC_ACQ_REL)

This should cause the "code either side of this fence to remain in those locations". But that also may have it's own issues as well. I've been trying to find a way to prevent code hoist/sink without atomic's but I'm not sure it's possible.

This is just one of those lovely examples of why C is not a good language in any way.

Anyway, another option could be instead of the switch statement, to use an if like this:

if (num_poll > 0) {
} else if (num_poll < 0 ) {
}

That also may end up working nicer than the case switch, especially WRT to optimisation since the "common" case is the first one. You can also then move the if conn_count != last_count inside that loop too.

Anyway, a few ideas for you :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Firstyear I understand your comments about compiler optimizations, pipelining and the resulting out of order execution that can result however I cannot devise any scenario where reordering the code would result in this performance loss without violating data dependency.

However, I have rewritten the suspect code segment in various different ways, changing the semantics between if-then/case, nesting/de-nesting operations, etc. I can find no solution that preserves the performance without the call to slapi_log_err. The location of the call to slapi_log_err and message are irrelevant. At any point within the while(!g_get_shutdown) loop emitting any message from slapi_log_err corrects the performance regression.

This brings me to the conclusion that this must be a thread yield or thread scheduling problem that is resolved by the action of slapi_log_err local to the thread, which makes me think the thread local I/O is waking/scheduling the thread. Based on that I am beginning to suspect that moving my code away from the signalpipe semantic was not a correct decision. I am going to pull the signalpipe semantic back into the epoll code to confirm this suspicion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding signalpipe to the connection table polling did nothing to improve the regression.

I'm at a complete loss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you suspect it's a thread yield issue, can you try pthread_yield() ?

Copy link
Contributor

@Firstyear Firstyear Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@llile Hey there, I've started to look into this a bit. Looking at my decompiled outputs, I don't see any difference in the results beside the added log output - no weird co-arrangement etc. Even the same variables and offsets.

Can you provide me:

  • What compiler version you are using
  • What build settings you are using
  • a copy of build/.libs/ns-slapd (should be the ELF file) from the "with" and "without" log message variants so that I can decompile and help investigate?
  • A copy of libslapd.so (should also be in .libs) from both builds. Should also be an ELF

EDIT: Saying this it's quite interesting as it appears that clang is emitting a much better connection loop compared to gcc, as gcc is putting in the g_get_shutdown() assuming it's "likely true" so all the shutdown components are preceeding the epoll wait, where as clang assumes it's "likely false" so it has shutdown at the end as per the C files. We might need to change this too:

while (!g_get_shutdown()) {

TO

while ( __builtin_expect(  !g_get_shutdown(), !0)) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @Firstyear, sorry for the delayed response.

The compiler is gcc-11.2.1-7.fc35.x86_64 and the configure settings are $ ./configure --exec-prefix=/usr --prefix=/ --datarootdir=/usr/share --datadir=/usr/share --libdir=/usr/lib64 --disable-asan --disable-tsan --enable-cmocka --with-systemd --with-selinux --with-openldap

What is the best method to provide you with the binaries?

Have you made any discoveries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that google drive or some other drop box would work. Alternately you can tar them into a PR or similar for github. Finally, if they compress to small than 10 MB you can email them to me at william.brown at suse.com :)

I did discover a bit in the reversing process but nothing obvious that would cause the delay. That's why I'm curious to see your build outputs. I'll also try to build these myself today too and compare once you send me your binaries too.

Copy link
Contributor

@Firstyear Firstyear Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had an initial look. My test results with: /opt/dirsrv/bin/ldclt -h localhost -p 3389 -N 10 -n 4 -e bindeach -e bindonly. I avoided search paths so that this would stress only anonymous (fully in memory) binds and connections only.

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr  (364218.12/sec), total: 36421811
GCC w_out log line: ldclt[22341]: Global average rate: 8829269.00/thr  (353170.75/sec), total: 35317075
Clang w_ log line: ldclt[840]: Global average rate: 9532904.00/thr  (381316.16/sec), total: 38131614
Clang w_out log line: ldclt[27872]: Global average rate: 9507196.00/thr  (380287.84/sec), total: 38028785

The two clang results are within +-0.2% which means I perceive this to be noise. The GCC result however is +- 3%, where the presence of the log line increases performance by 3%. This is still a trivial amount IMO, but there is certainly more weight to the fact that there is a code generation issue with GCC related to the presence or absence of this log line, that does not affect clang.

I'm not seeing the 20x kind of number that you are reporting though, so I'm wondering if there is something else going on in your test methodology, as my results show it as much closer. Regardless, there is a perceptable difference with the log line with GCC.

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr  (364218.12/sec), total: 36421811
GCC w_out log line: ldclt[22341]: Global average rate: 8829269.00/thr  (353170.75/sec), total: 35317075
GCC w trace line: Global average rate: 8508577.00/thr  (340343.09/sec), total: 34034307

Now if we change the log level to TRACE (so it emits nothing, but still is present) this shows that the performance decreases. This means that it is unlikely to be a code placement/alignment issue, but that some behaviour in ldap/servers/slapd/log.c slapi_log_error, after the loglevel check is influencing performance, likely be affecting cache or memory behaviour. Reading that function we can see there are two likely causes. Either the call to acquire the lock, or the call to issue the printf to the console. I attempted removing each of them in seperate builds.

Removing the either caused the performance to rise to where we expect, which makes me think it is the issuance of the syscall to either printf or lock that is the cause of the performance rise.

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr  (364218.12/sec), total: 36421811
GCC w_out printf: ldclt[3793]: Global average rate: 9260484.00/thr  (370419.38/sec), total: 37041935
GCC w_out lock: ldclt[10884]: Global average rate: 9313178.00/thr  (372527.12/sec), total: 37252714

But if we remove the call to slapd_log_error_proc_internal then the performance increases to the rates we saw without the lock.

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr  (364218.12/sec), total: 36421811
GCC w_out log_err ldclt[25077]: Global average rate: 9211057.00/thr  (368442.28/sec), total: 36844228

This starts to make me think that there is a data dependence or code re-arrangement that's going on by having the call to slapi_log_error present, which impacts the behaviour of the branch predictor potentially. Anyway, so I finally tried putting in the expect and removing the log line:

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr  (364218.12/sec), total: 36421811
GCC w_ log + expect: ldclt[32169]: Global average rate: 9226223.00/thr  (369048.91/sec), total: 36904892
GCC w_out log  + expect: ldclt[6841]: Global average rate: 9100628.00/thr  (364025.12/sec), total: 36402510

Adding the builtin expect line as suggested gets us up to the performance with the log line, but the log line still is increasing our performance.

Next I tried just simply adding a printf(""); instead of the log line.

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr  (364218.12/sec), total: 36421811
GCC w_ log + expect: ldclt[32169]: Global average rate: 9226223.00/thr  (369048.91/sec), total: 36904892
GCC w_ printf + expect: ldclt[13949]: Global average rate: 9437884.00/thr  (377515.38/sec), total: 37751534

This is starting to get up near clang performance! So we know that it's not the log itself, but likely the act of having the syscall that is helping somehow. That's probably also why the interaction with epoll_pr_idle exists.

Trying with sched yield instead of printf/slapi_log_err():

GCC w_ log line: ldclt[15225]: Global average rate: 9105453.00/thr  (364218.12/sec), total: 36421811
GCC w_ printf + expect: ldclt[13949]: Global average rate: 9437884.00/thr  (377515.38/sec), total: 37751534
GCC w_ yield + expect: ldclt[28141]: Global average rate: 9817390.00/thr  (392695.59/sec), total: 39269562

There we go! Now we are seeing some gains! The syscall to printf or lock() was likely allowing a yield point for the scheduler, so instead we can manually add one - and it's much cleaner :)

Now curiously, if we use clang with yield + expect it actually lowers performance. Again, likely just related to clang emitting better code.

So ultimately the changes needed to get GCC to clang's performance are:

diff --git a/ldap/servers/slapd/daemon.c b/ldap/servers/slapd/daemon.c
index 6f5d1f6ca..eb9f6f471 100644
--- a/ldap/servers/slapd/daemon.c
+++ b/ldap/servers/slapd/daemon.c
@@ -55,6 +55,7 @@
 #include <private/pprio.h>
 #include <ssl.h>
 #include "fe.h"
+#include <sched.h>

 #if defined(LDAP_IOCP)
 #define SLAPD_WAKEUP_TIMER 250
@@ -1156,12 +1157,22 @@ slapd_daemon(daemon_ports_t *ports)
 #endif

     /* The meat of the operation is in a loop on a call to select */
-    while (!g_get_shutdown()) {
+    while (
+#ifdef __clang__
+        !g_get_shutdown()
+#else
+        __builtin_expect(!g_get_shutdown(), !0)
+#endif
+    ) {
 #ifdef ENABLE_EPOLL

         // XXX emitting any message between here and #else, provides 20x performance improvement
         // XXX except as noted below near epoll_pr_idle
-        slapi_log_err(SLAPI_LOG_ERR, "slapd_daemon", "epoll mystery\n");
+#ifndef __clang__
+        sched_yield();
+#endif

         // XXX There appears to be a slight performance improvment with if/else vs. case as firstyear predicted
         if ((num_poll = epoll_wait(the_connection_table->epollfd, ct_epoll_events, connection_table_size, slapd_wakeup_timer)) > 0) {

@llile
Copy link
Author

llile commented Mar 17, 2022

I have tidied up the code, noting the performance anomaly with XXX markers. I do not comprehend this issue.

@tbordaz could you possibly analyze this version versus 2.x?

@tbordaz
Copy link
Contributor

tbordaz commented Mar 18, 2022

@llile, I am really sorry for this late feedback. I will rerun the tests on the latest version and describe the details of the test and what I am observing. My test environment got cleared and I had to recreate it.

@tbordaz
Copy link
Contributor

tbordaz commented Mar 21, 2022

The performance (throughput and response time) of recent patch are similar as with the first patch. With epoll, throughput drops by 40% and response time increase by 60%. Using ping searchrate tool.

389-ds-base-2.0
ldapbench/bin/searchrate --hostname server --port 389 --bindDN "cn=Directory manager" --bindPassword password --baseDN dc=example,dc=com --scope sub --filter "(uid=user[0:10000%00000])" --attribute dn --numThreads 50 -I 5
      Recent       Recent       Recent       Recent      Overall      Overall
Searches/Sec   Avg Dur ms Entries/Srch   Errors/Sec Searches/Sec   Avg Dur ms
------------ ------------ ------------ ------------ ------------ ------------
  150977.257        0.329        1.000        0.000   150977.218        0.329
  154457.337        0.323        1.000        0.000   152717.236        0.326
  153951.556        0.324        1.000        0.000   153128.693        0.325
  156661.714        0.318        1.000        0.000   154011.973        0.323
  158359.078        0.315        1.000        0.000   154881.280        0.322

389-ds-base-2.0 + epoll
ldapbench/bin/searchrate --hostname server --port 389 --bindDN "cn=Directory manager" --bindPassword password --baseDN dc=example,dc=com --scope sub --filter "(uid=user[0:10000%00000])" --attribute dn --numThreads 50 -I 5
      Recent       Recent       Recent       Recent      Overall      Overall
Searches/Sec   Avg Dur ms Entries/Srch   Errors/Sec Searches/Sec   Avg Dur ms
------------ ------------ ------------ ------------ ------------ ------------
   97315.923        0.512        1.000        0.000    97315.899        0.512
   97074.039        0.513        1.000        0.000    97194.966        0.513
   93545.599        0.534        1.000        0.000    95978.656        0.519
   94132.058        0.530        1.000        0.000    95516.981        0.522
   89023.958        0.560        1.000        0.000    94218.316        0.529

I installed a standalone instance, with very large dbcache and entry cache store all DB entries and indexes. I will attach the database ldif file

@tbordaz
Copy link
Contributor

tbordaz commented Mar 21, 2022

users.ldif.final.gz

@tbordaz
Copy link
Contributor

tbordaz commented Aug 21, 2023

This PR is likely impacted by multi-listener, removal turbo mode and dynamic thread management. Code changed but also new set of tests are available.
Need to reevaluate this PR with those changes and see if epoll performance could be an alternative.

@Firstyear
Copy link
Contributor

100% agree, I think epoll should be looked at next and then the remaining bottlenecks found and cleaned up :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants