Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLMG failed #2088

Closed
wang1202 opened this issue Jan 27, 2025 · 15 comments
Closed

MLMG failed #2088

wang1202 opened this issue Jan 27, 2025 · 15 comments

Comments

@wang1202
Copy link

Hello,

Thank you for resolving the previous grid refinement issue. Now I'm testing the case DevTests/FlowInABox, and I found that the refined grid is very unstable. I can set amr.grid_eff = 0.1 or amr.n_error_buf = 5 to have the simulation run for hundreds of steps, but this causes the entire domain to be refined, which is not desired. Do you have any suggestions? Below is the last 5 iterations information when setting erf.mg_v = 4:

AT LEVEL 0 0   UP: Norm after  bottom 1.199623168e-08
MLMG: Iteration  95 Fine resid/bnorm = 2.211093305e-09
MLMG: Subtracting -4.394183651e-16 from mf component c = 0 on level (0, 0)
AT LEVEL 0 0   DN: Norm before bottom 1.199623049e-08
MLMG: Subtracting 1.71931457e-22 from mf component c = 0 on level (0, 0)
MLMG: Bottom solve failed.
AT LEVEL 0 0   UP: Norm after  bottom 1.190413938e-08
MLMG: Iteration  96 Fine resid/bnorm = 2.194119532e-09
MLMG: Subtracting -4.372072655e-16 from mf component c = 0 on level (0, 0)
AT LEVEL 0 0   DN: Norm before bottom 1.190413971e-08
MLMG: Subtracting -1.033134579e-22 from mf component c = 0 on level (0, 0)
MLMG: Bottom solve failed.
AT LEVEL 0 0   UP: Norm after  bottom 1.181302101e-08
MLMG: Iteration  97 Fine resid/bnorm = 2.177324811e-09
MLMG: Subtracting -4.396645998e-16 from mf component c = 0 on level (0, 0)
AT LEVEL 0 0   DN: Norm before bottom 1.181302037e-08
MLMG: Subtracting -1.870463104e-22 from mf component c = 0 on level (0, 0)
MLMG: Bottom solve failed.
AT LEVEL 0 0   UP: Norm after  bottom 1.172286616e-08
MLMG: Iteration  98 Fine resid/bnorm = 2.160707833e-09
MLMG: Subtracting -4.391924888e-16 from mf component c = 0 on level (0, 0)
AT LEVEL 0 0   DN: Norm before bottom 1.172286538e-08
MLMG: Subtracting -3.025547068e-22 from mf component c = 0 on level (0, 0)
MLMG: Bottom solve failed.
AT LEVEL 0 0   UP: Norm after  bottom 1.163366464e-08
MLMG: Iteration  99 Fine resid/bnorm = 2.144266838e-09
MLMG: Subtracting -4.381772983e-16 from mf component c = 0 on level (0, 0)
AT LEVEL 0 0   DN: Norm before bottom 1.163366518e-08
MLMG: Subtracting 4.585983918e-23 from mf component c = 0 on level (0, 0)
MLMG: Bottom solve failed.
AT LEVEL 0 0   UP: Norm after  bottom 1.154540596e-08
MLMG: Iteration 100 Fine resid/bnorm = 2.127999289e-09
MLMG: Failed to converge after 100 iterations. resid, resid/bnorm = 1.154540644e-08, 2.127999289e-09
amrex::Abort::0::MLMG failed. !!!

And below is the input I adjusted:

amr.n_cell       = 64  64  32

xlo.theta = 288.
xhi.theta = 288.
ylo.theta = 288.
yhi.theta = 288.
zlo.theta = 294.
zhi.theta = 282.

erf.cfl            = 0.8
erf.substepping_cfl = 0.5
erf.dt_max_initial = 0.1
erf.dt_max         = 0.1

# REFINEMENT / REGRIDDING
amr.max_level       = 1       # maximum level number allowed
amr.max_grid_size   = 256

erf.regrid_int = 2
erf.coupling_type = "TwoWay"

erf.refinement_indicators = diff_theta

erf.diff_theta.max_level     = 1
erf.diff_theta.field_name    = theta
erf.diff_theta.adjacent_difference_greater    = 1.0 2.0
erf.diff_theta.start_time = 1. 2.

amr.n_error_buf  = 5 5
#amr.grid_eff     = 0.1

# PROBLEM PARAMETERS
prob.rho_0 = 1.0
prob.T_0          = 288.
prob.T_0_Pert_Mag = 0.1
prob.U_0_Pert_Mag = 0.01
prob.V_0_Pert_Mag = 0.01
prob.W_0_Pert_Mag = 0.01
@asalmgren
Copy link
Collaborator

Yes -- one of the things I noticed was that the grids at level 1 weren't "ideal". Typically if using multigrid we want to make sure the level 1 grids are sufficiently coarsenable for good multigrid performance, which basically means they should be m * 2^n, e.g 32 = 2^5 or 48 = 3 * 2^4. You can control this by setting amr.blocking_factor -- but notice that will also make the individual grids larger than you might want if you start with a small domain.

There are a lot of ways to control the size and shape of the grids and it takes a while to figure out what the ideal configuration is. grid_eff, n_error_buf, blocking_factor are some of the best.

This page is a good reference for how the grids are created.

One way to play with this might be to start at a later time where there's a region you know you want refined and play with the parameters starting at that point to see which grid coverage is most efficient and does what you want.

Happy to help more -- let me know if this is at all helpful.

@wang1202
Copy link
Author

wang1202 commented Jan 27, 2025

Hi @asalmgren, thanks for the detailed information. I didn't pay attention to blocking_factor before. Yes, I restarted the run after using the coarse grid to reach a steady state first. I have two quick questions:

  1. "the grids at level 1 were't ideal" -- do you mean amr.n_error_buf should be the multiple of 2?
  2. I've tested many combinations of grid_eff, n_error_buf, blocking_factor, and I found that it only works when the entire grid is refined. I'm considering testing the tolerance and iteration time. What's the easiest way to modify them?

@asalmgren
Copy link
Collaborator

No -- n_error_buf doesn't have anything to do with the eventual specific size/shape of the grid, it just says how much you "buffer" the features you care about before you start the tagging procedure. One way to think about that is if you know a feature is moving at one grid cell per time step but you don't want to refine every timestep, you would set n_error_buf to be roughly regrid_int, so, for example, if you set n_error_buf = 5 then you could move 5 timesteps before the feature reached the coarse/fine boundary because you had created the grids with that much extra space to start with. This is overly simplistic, but hopefully gives the idea.

@asalmgren
Copy link
Collaborator

If you start with a relatively small grid (e.g. 32x64) and require the fine grids to be "relatively large", it will typically fill a lot of the domain. I also noticed that when the problem starts there aren't any well-defined regions to define. Can you share a picture of when the gridding is doing something you don't want it to do?

@wang1202
Copy link
Author

If you start with a relatively small grid (e.g. 32x64) and require the fine grids to be "relatively large", it will typically fill a lot of the domain. I also noticed that when the problem starts there aren't any well-defined regions to define. Can you share a picture of when the gridding is doing something you don't want it to do?

Sure. Please see the two snapshots before and after the gridding. I just want to locations with sharp temperature gradient to be refined, but the MLMG only converges when the entire domain is refined.

Image Image

@wang1202
Copy link
Author

No -- n_error_buf doesn't have anything to do with the eventual specific size/shape of the grid, it just says how much you "buffer" the features you care about before you start the tagging procedure. One way to think about that is if you know a feature is moving at one grid cell per time step but you don't want to refine every timestep, you would set n_error_buf to be roughly regrid_int, so, for example, if you set n_error_buf = 5 then you could move 5 timesteps before the feature reached the coarse/fine boundary because you had created the grids with that much extra space to start with. This is overly simplistic, but hopefully gives the idea.

Thanks for the clarification, but then I still don’t understand why level 1 isn't 'ideal.' Could you point out where the settings show that the level 1 grid is not ideal?

@asalmgren
Copy link
Collaborator

There are tradeoffs between grid size and multigrid performance. Multigrid -- to work well-- needs to be able to coarsen. Imagine at level 1 if you have a 4x4x4 grid and a 64x64x64 grids. The level will only be able to coarsen once, which means the "bottom solver" has to deal with a 2x2x2 grid (fine) and a 32x32x32 grid (expensive). So ideally we want blocking factor at least 8 ... but when you try to decompose the level into boxes that are coarsenable by 8 and cover all your tagged points, it's a hard problem.

@asalmgren
Copy link
Collaborator

My suggestion is to cut n_error_buf down to 2 -- that's our usual default -- and set max_grid_size = 8. How does that work?

@wang1202
Copy link
Author

Thanks for the details and suggestion. I've tested the following settings:

amr.max_grid_size   = 8
amr.n_error_buf     = 2
amr.grid_eff        = 0.6
amr.blocking_factor = 4

Result: MLMG: Failed to converge after 100 iterations.

Increase blocking_factor:

amr.max_grid_size   = 8
amr.n_error_buf     = 2
amr.grid_eff        = 0.6
amr.blocking_factor = 8

Result: Entire domain is refined.

Reduce grid_eff:

amr.max_grid_size   = 8
amr.n_error_buf     = 2
amr.grid_eff        = 0.5
amr.blocking_factor = 4

Result: Entire domain is refined.

@asalmgren
Copy link
Collaborator

I think the code is doing exactly what you're telling it to do subject to the constraints. Can you modify your tagging criteria to make it only a very small region refined and verify that you can get one grid not covering the whole region?

Keep in mind btw that blocking_factor = 8 means not just that the grids have to be 8 wide but more specifically that the left edge will have to be at i = 0, 8, 16, 32, 40, 48, or 56.

@wang1202
Copy link
Author

If I increase erf.diff_theta.adjacent_difference_greater from 1.0 to 1.5, this setting fails (which works with diff_theta... = 1.0)

amr.max_grid_size   = 8
amr.n_error_buf     = 2
amr.grid_eff        = 0.6
amr.blocking_factor = 8

I need to reduce the grid_eff by 0.1:

amr.max_grid_size   = 4
amr.n_error_buf     = 2
amr.grid_eff        = 0.5
amr.blocking_factor = 8

Then I can get a layer not refined in the middle before the entire domain is refined.

Image

I removed my previous comment about the advection scheme, because I found that I forgot to replace the input file's name when testing different advection schemes... after further testing I found that advection scheme may not be the reason.

@asalmgren
Copy link
Collaborator

One thing we can think about if the smaller grids are important -- if we implement an algorithm without the subcyling in time between levels, then the Poisson solves would be over the whole grid hierarchy which means we don't need the level 1 grids to be so coarsenable. But that would take a little time to make those changes.

@wang1202
Copy link
Author

wang1202 commented Jan 29, 2025

Thanks for the information, @asalmgren! I have another question (not sure whether this is related to the dynamic regridding issue I met). I found that the smaller grid is hard to be separated horizontally. For some simple tests of static mesh refinement, I tried to refine half of the domain. Separating the grid vertically works well, but separating the grid horizontally always fail. Below is how I cut the grid.

The domain:

# PROBLEM SIZE & GEOMETRY
geometry.prob_lo = -1. -1.  0.
geometry.prob_hi =  1.  1.  1.
amr.n_cell       = 64  64  32

This works well:

erf.refinement_indicators =  box1
erf.box1.max_level     = 1
erf.box1.in_box_lo = -1. -1.  0.
erf.box1.in_box_hi = 1.  1.  0.5

This fails (output: SIGILL Invalid, privileged, or ill-formed instruction):

erf.refinement_indicators =  box1
erf.box1.max_level     = 1
erf.box1.in_box_lo = -1. -1.  0.
erf.box1.in_box_hi = 0.  1.  1.

This is the output of the last step before failing. Some values appear as large as e+21 or e+22:

[Level 0 step 3188] Advanced 131072 cells
[Level 1 step 3] ADVANCE from time = 300.0991465 to 300.1277102 with dt = 0.02856364786
Making slow rhs at time 300.0991465 for fast variables advancing from 300.0991465 to 300.1277102
 No-substepping time integration at level 1 to 300.1277102 with dt = 0.02856364786
Max/L2 norm of divergence before solve at level 1 : 1.823740494e+21 1.485771711e+22
MLMG: Initial rhs               = 1.823740494e+21
MLMG: Initial residual (resid0) = 1.823740494e+21
MLMG: Final Iter. 8 resid, resid/bnorm = 3.246153728e+10, 1.779942782e-11
MLMG: Timers: Solve = 0.029332875 Iter = 0.02683175 Bottom = 0.000173165
Time in solve 0.031279
Max/L2 norm of divergence  after solve at level 1 : 3.246154138e+10 2.151640924e+12

Below are some other inputs information that may affect:

fabarray.mfiter_tile_size = 1024 1024 1024

# TIME STEP CONTROL
erf.cfl            = 0.8
erf.substepping_cfl = 0.5
erf.dt_max_initial = 0.1
erf.dt_max         = 0.1

# REFINEMENT / REGRIDDING
amr.max_level       = 1       # maximum level number allowed
amr.ref_ratio = 2
amr.max_grid_size   = 256
amr.n_error_buf     = 0
amr.grid_eff        = 0.5
amr.blocking_factor = 8

erf.regrid_int = 2
erf.coupling_type = "TwoWay"

@asalmgren
Copy link
Collaborator

@wang1202 -- Sorry for the delay! I believe this issue is now fixed in PR 2095 -- can you give it a try?

@wang1202
Copy link
Author

wang1202 commented Feb 4, 2025

Hi @asalmgren, I think it works now. Thank you for the assistance! Allowing the horizontal refinement seems important to me. Although I'm still testing the options to have a stabler run, now I can see the box separated horizontally as shown here.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants