-
Notifications
You must be signed in to change notification settings - Fork 2
ManualEnsemble class for specifying all comms in an Ensemble
#189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@JDBetteridge this PR only involves the |
JDBetteridge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may need to think about this some more...
| weakref.finalize(new_ensemble, split_global_comm.Free) | ||
| weakref.finalize(new_ensemble, split_ensemble_comm.Free) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the finalizer should be set in the __init__ method of the ManualEnsemble. It's the pattern used elsewhere and it prevents someone (user or developer) forgetting to add the finalizers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes back to the question of what ManualEnsemble should be responsible for. In this case the ensemble_comm and the global_comm need finalising but the spatial_comm doesn't.
Currently I've gone with "the user of ManualEnsemble is totally responsible for the comms they pass in", but we could have optional arguments to ManualEnsemble.__init__ for which comms to set finalizers for, I'd be ok with that.
Just to note, I'm saying "user" here but ManualEnsemble isn't exposed publicly, it's meant for internal use so I'd expect it to always be wrapped in something like the split_ensemble function which has more knowledge about comm lifetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
class ManualEnsemble(Ensemble)
def __init__(self, global_comm, spatial_comm, ensemble_comm,
finalize_global_comm=False, finalize_spatial_comm=False, finalize_ensemble_comm=False)
if finalize_global_comm:
weakref.finalize(self, global_comm.Free)
if finalize_spatial_comm:
weakref.finalize(self, spatial_comm.Free)
if finalize_ensemble_comm:
weakref.finalize(self, ensemble_comm.Free)
...| raise PyOP2CommError("spatial_comm must be subgroup of global_comm") | ||
| if not is_subgroup(ensemble_group, global_group): | ||
| raise PyOP2CommError("ensemble_comm must be subgroup of global_comm") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic isn't completely exhaustive, it doesn't currently check whether you have the same communicator. For example:
if __name__ == "__main__":
r = COMM_WORLD.rank
s = COMM_WORLD.size
ensemble_color1 = int(r < s/2)
ensemble1 = COMM_WORLD.Split(color=ensemble_color1, key=r)
ensemble_color2 = r >= s/2
ensemble2 = COMM_WORLD.Split(color=ensemble_color2, key=r)
spatial_color = r % (s/2)
spatial = COMM_WORLD.Split(color=spatial_color, key=r)
correct = ManualEnsemble(COMM_WORLD, spatial, ensemble1)
if r < s/2:
broken = ManualEnsemble(COMM_WORLD, spatial, ensemble1)
else:
broken = ManualEnsemble(COMM_WORLD, spatial, ensemble2)
print("ALL PASSED")Will run just fine, but the broken ensemble uses two different communicators.
This is broken in a very subtle way as the mismatched comm will be destroyed when the ensemble is destroyed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see what is broken about this.
Assuming 8 ranks, ranks 0-3 use the ensemble comm from the first Split, and ranks 4-7 use the ensemble comm from the second Split call, but this doesn't matter. All ranks in each ensemble comm use the same one, which is what matters.
They don't "know", and don't need to know, what the other half is doing so long as every rank has an ensemble comm that connects the same part of all spatial comms.
As written so far, ManualEnsemble doesn't destroy any of the comms its given (the docstring explicitly says that the caller is responsible for this).
This PR slightly changes the way we split a large
Ensembleinto multiple smallerEnsembles, either in theSliceJacobiPCor in the nonlinear Gauss-Seidel iterations.firedrake.Ensembleessentially has two responsibilities:global_comm, aspatial_comm, and anensemble_comm. Currently,firedrake.Ensembledoes this by taking aglobal_command splitting it into a Cartesian product of comms, providing aspatial_command anensemble_commfrom this product.mpi4pycalls so that we can sendfiredrake.Functionsfrom onespatial_commto another across theensemble_commwith a simple API and some sanity checks.Creating a split ensemble involves intercepting the logic in 1 to make sure the comms in the split ensemble relate properly to the comms in the larger ensemble. Specifically, the split ensemble needs three communicators:
global_commsplit from theglobal_commof the larger ensembleensemble_commsplit from theglobal_commorensemble_commof the larger ensemblespatial_commthat is the same as thespatial_commfrom the larger ensemble so we can use the same mesh with both Ensembles.The main issue here is that we need the
spatial_commof the split ensemble to be the same comm as thespatial_commof the large ensemble, not just congruent. This means that we can't just make the smaller global_comm for the split ensemble and reuse the existingfiredrake.Ensemble.__init__.Previously I made a new
EnsembleConnectorclass (terrible name, it connects existingspatial_comms, it doesn't connect differentEnsembles). This class inherited fromfiredrake.Ensemblebut overrode__init__, taking aglobal_command a specificspatial_comm, then created a newensemble_commby splitting the providedglobal_comm. To go with this is asplit_ensemblefunction that takes in a large ensemble, splits it'sglobal_comm, and passes the splitglobal_command thespatial_commto the newEnsembleConnector.This works fine for our case but has a couple of issues (other than the naming issues that already plague
Ensemble).split_ensemblefunction does some, but not all of task 1, making sure we have three viable comms. It sorts out theglobal_commandspatial_commbut not theensemble_comm, which is left to theEnsembleConnector.spatial_commbut not theensemble_comm(what about the case where I already have theensemble_commbut not thespatial_comm, or already have both).This PR changes to having a
ManualEnsembleclass that inherits fromfiredrake.Ensemblebut just takes three comms and checks that they look like a global/spatial/ensemble comm set (i.e. they look like a cartesian product of comms). It essentially is only doing task 2, wrappingmpi4pycalls, and trusts that the three provided comms are a valid set to use.The
split_ensemblefunction now does all of task 1, taking in a largerEnsemble, splitting theglobal_command theensemble_comm, and passing these plus the originalspatial_commtoManualEnsemble.ManualEnsembleis simpler and more general thanEnsembleConnectorwas (also more of a footgun, but I've tried to add enough checks), andsplit_ensemblenow takes care of all of the logic of splitting anEnsemble, rather than just some of it.