-
Couldn't load subscription status.
- Fork 368
cpu memory optimization rebased to main #3868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b9b6aeb to
51f64f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to add a link to the resource_management page in index.rst
|
|
||
| .. code-block:: bash | ||
| export TRIM_CPU_MEMORY=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets prefix this with TORCHTRT. So TORCHTRT_ENABLE_BUILDER_MALLOC_TRIM I think would be more clear
| export TRIM_CPU_MEMORY=1 | ||
| This reduces approximately **2×** of redundant model copies, limiting | ||
| total CPU memory usage to up to **3×** the model size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3x
| offload_module_to_cpu = False | ||
| This removes another **1×** model copy, reducing peak CPU memory | ||
| usage to about **2×** the model size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2x
| GPU Memory | ||
| ^^^^^^^^^^ | ||
|
|
||
| By default, Torch-TensorRT may consume up to **2×** the model size in GPU memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2x
| offload_module_to_cpu = True | ||
| This shifts one model copy from GPU to CPU memory. | ||
| As a result, peak GPU memory usage decreases to about **1×** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1x
| This shifts one model copy from GPU to CPU memory. | ||
| As a result, peak GPU memory usage decreases to about **1×** | ||
| the model size, while CPU memory usage increases by roughly **1×**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing, can we say increases to roughly **2x** model size
| for attr in dir(module): | ||
| if attr.startswith("_frozen_param"): | ||
| delattr(module, attr) | ||
| release_memory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this function name a little more specific?
|
|
||
| @needs_refit # type: ignore[misc] | ||
| def _insert_engine_to_cache(self, hash_val: str, serialized_engine: bytes) -> None: | ||
| def _insert_engine_to_cache(self, hash_val: str, engine: trt.ICudaEngine) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zewenli98 when do these calls run? will this conflict with the goal of keeping mem usage under 3x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do caching in a post processing step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like we can give the cache entry as one of the Interpreter Result fields
| self.tag(subgraphs) | ||
| return self.split() | ||
|
|
||
| def calculate_num_of_break(self, subgraphs: List[Subgraph]) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calculate_num_breaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go in with the other malloc_trim things or be in the graph break pr?
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes # (issue)
Type of change
Please delete options that are not relevant and/or add your own.
Checklist: