-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BYO-FT] Improve check for available memory #210
[BYO-FT] Improve check for available memory #210
Conversation
Previously, any parameters whose allocations was performed by the relax VM would be double-counted.
@@ -232,7 +238,9 @@ def profile_memory_usage(self, seq_lens): | |||
|
|||
self.mod["evaluate"](input_ids, positions, seq_lens, self.params) | |||
|
|||
return self.get_used_memory() | |||
vm_alloc_after = self.get_used_memory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry there is an incredibly embarrassing hack I did to TVM's PooledAllocator
based on an assumption that get_used_memory()
is only called once:
https://github.com/octoml/tvm/blob/for-mlc-serve-jan12/src/runtime/memory/pooled_allocator.h#L114-L116
This was to change the behavior of PooledAllocator
before / after memory profiling, to workaround a limitation of the allocator. I have a branch that integrates a third-party CUDA allocator https://github.com/rapidsai/rmm to replace PooledAllocator
, so that we don't have to worry about this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, we can make get_used_memory()
pure by introducing a global function to toggle the allocation mode. I should have done that but I did the shortcut of abusing get_used_memory()
to signal the end of memory profiling.
Such global function is also useful even after we land the allocator based on rmm, since tracking statistics for memory allocation is costly and should only be done during memory profiling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. So, by calling it twice, I'm invalidating the safety net anyways.
I think in that case, I'd lean toward disabling this check altogether, and replacing it with a compile-time tracking of memory usage. That way, instead of profiling, we could know the maximum size of the live set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if such compile-time estimate of memory usage agrees with the "ground truth" that we collect at runtime, given that the latter is affected by the behavior of the underlying allocator which the memory planner is not aware of. And my impression of Relax's memory planning has been that it allocates much more than necessary, compared to the allocator in torch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think that increased memory would show up in the compile-time estimation, because a large part of the footprint is due to operations that aren't done in place. @slyubomirsky has been doing a lot of good work for in-place operations, which should give us a path forward on footprint.
When do we encounter such params? I thought all params are allocated by |
In the current flow with pre-processed weights, that is correct. All parameters are allocated by For the pre-process at initialization flow, which will be used to support bring-your-own-fine-tune, the |
Hi, @Lunderberg! Thanks for the contribution!
Does it mean there will be a dynamic allocation? To guarantee that we wouldn't get OOM during the runtime, we've been very careful about the memory allocation assuming the worst worst scenario. I think we are reaching the point where we would need to revisit this assumption, so your input will be very appreciated :) |
Currently, all the allocations are static, but that's entirely due to the weights having static shape. While we could in theory write a I think it would be good to have a PrimFunc that takes any dynamic parameters, and returns the memory footprint for that set of dynamic parameters. |
Previously, any parameters whose allocations was performed by the relax VM would be double-counted.