-
Notifications
You must be signed in to change notification settings - Fork 31
Cachepath work broken out into functional commits #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
src/client/client/intercept_exec.c
Outdated
| debug_printf2("Propogating spindle environment by copying it to new envp list\n"); | ||
| for (cur = (char **) envp; *cur; cur++, orig_size++); | ||
| new_size = orig_size + 10; | ||
| new_size = orig_size + 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be 'new_size = orig_size + 9'. Eight environment variables get propogated, plus one slot for the NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| cachepath = chosen_realized_cachepath; | ||
| chosen_parsed_cachepath = chosen_parsed_cachepath; | ||
| chosen_symbolic_cachepath = chosen_symbolic_cachepath; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by the set_intercept_readlink_cachepath() and set_should_intercept_cachepath(). Seems like we're keeping multiple copies of the paths in different static variables, then calling each of these functions to set the different copies.
Different variables holding the same copies of information are a source of bugs. Let's consolidate the variables holding paths in the clients to just one instance.
Or am I missing something where these are different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two issues here:
-
We're using setter functions to pass around pointers to static file-scoped variables. Making the original variable global gets rid of the copies, but creates its own source of bugs.
-
We're keeping track of the realized, parsed, and symbolic versions of the cachepaths, Parsed and symbolic aren't used in
set_inercept_readlink_cachepath()and symbolic isn't used inset_should_intercept_cachepath(). (The code that references them in the setter function is there to silence the compiler warning.)
Let's err on the side of deleting more code. Removed the two set_*() functions.
| int exit_readys_recvd; | ||
| ldcs_dist_model_t dist_model; | ||
| ldcs_client_t* client_table; | ||
| char *location; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've still got 'location'. What's does 'location' represent anymore? Is it a duplicate variable for cachepath?
Also, please comment what these variables contain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior of location hasn't changed, except that it's not longer used for cachepaths (it only affects the path used for fifo, the daemon, etc.).
We had discussed removing it entirely. Let's leave that for a separate PR.
I've added comments.
| char *candidate_cachepaths; /* Colon-separated list of candidate paths (max 64) */ | ||
| char *chosen_cachepath; /* The consensus path (same across all nodes). */ | ||
| uint64_t cachepath_bitidx; /* Bit index used by allReduce() to arrive at consensus. */ | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spindle_launch is the user interface function. We don't need to expose chosen_cachepath or cachepath_bitidx to the user. They're not selecting or reading those values.
Also, what's the difference between 'location' and 'candidate_cachepaths' here? What's it mean if a user sets both or only one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, location is used for paths to the fifo directory, the daemon file, and anything else that's not a cachepath. I expect we'll remove it in a following PR.
Removing chosen_cachepath and cachepath_bitidx resulted in a modest string of perturbations, ending with relocating "parseloc.h" to /src/util.
| } | ||
|
|
||
| COMM_LOCK; | ||
| client_recv_msg_static(fd, &message, LDCS_READ_BLOCK); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of messy doing one send and three recvs. In other parts of spindle the recv would be one message with three strings rather than three messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified to use a single reply. Removed future-proofing (not using the symbolic version of cachepaths).
| msgbundle_force_flush(procdata); | ||
| } | ||
|
|
||
| ldcs_audit_server_md_consensus(procdata, msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest renaming ldcs_audit_server_md_consensus(...) to something like ldcs_audit_server_md_allreduce(MD_AND).
The ldcs_audit_server_md_* are all network operation focused, not spindle algorithm focused. That's the layer where you'd add a different network implementation (like infiniband), and we want to keep higher-level spindle concepts out of that layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed/redefined to ldcs_audit_server_md_allreduce_AND( uint64_t *val ).
I wanted to use your MD_AND suggestion, but that's going to result in a duplicated op enum in md-space and coho-space. As we're only using the bitwise-and op in md-space, let's kick that problem down the road and put the op name in the function name. Ugly, but effective.
| ldcs_send_msg(connid, &msg); | ||
| procdata->server_stat.clientmsg.cnt++; | ||
| procdata->server_stat.clientmsg.time += ldcs_get_time() - client->query_arrival_time; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repeated adds to clientmsg.cnt and clientmsg.time is triple counting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in earlier commit when we went to a single-message response.
| "reloc-python", &relocpython, | ||
| "python-prefix", &pyprefix, | ||
| "location", &location, | ||
| "numa", &numa, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to previous, what's the difference between location and cachepaths now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"location" will be renamed to "commpath" in a subsequent PR.
src/fe/startup/config_mgr.h
Outdated
| shortExecExcludes = 298, | ||
| shortPatchLdso | ||
| shortPatchLdso, | ||
| shortCachePaths, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to keep the pattern going, could you add numbers to the enum values (and fix shortPatchLdso while at it). I know it's not necessary, but it makes it easier to look values from debug_printfs if they're explicit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/cobo/cobo.c
Outdated
| /* not the root, so forward our reduction result to our parent */ | ||
| if (cobo_write_fd(cobo_parent_fd, pval, sizeof(*pval)) < 0) { | ||
| err_printf("Sending reduced data to parent failed\n"); | ||
| exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't exit on network failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed 27 instances of exiting on network failure to "return -1" and two to "return NULL".
Not confident that's a net gain. If we end up pulling the network code out into its own library, we can revisit error handling then.
Previously, chosen_realized_cachepath was copied into set_intercept_readlink_cachepath() chosen_realized_cachepath and chosen_parsed_cachepath were copied into set_should_intercept_cachepath() This PR removes both setter functions and makes the original pointers global.
Removes chosen_cachepath and cachepath_bitindex from spindle_launch.h Updates initialization of matching variables in ldcs_process_data. determineValidCachePaths() moved from spindle_be.cc to ldcs_audit_server_process.c to get ldcs_process_data visibility. Added #include "parseloc.h" to ldcs_audit_server_process.c to get declaration of determineValidCachePaths(). Relocated "parseloc.h" to src/util so ldcs_audit_server_process.c could find it. Trued up signedness of types caused my making "parseloc.h" more visible, e.g., cachepath_bitidx is now uint64_t everywhere.
The three-message-reply response is now a single message with two strings. The symbolic version of the cachepath is no longer communicated as it was not being used.
New name is ldcs_audit_server_md_allreduce_AND(). If we get to the point where we're using other allreduce operations we can solve the problem of duplicating the op list in md-land and cobo-land. For now, we're only using one op in md-land, so the op can go into the function name.
Fixes #61