-
Notifications
You must be signed in to change notification settings - Fork 31
Unify model naming convention #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Good start. But I don't think we should use model aliases in such a manner. It is very confusing that MACE-mp-0-small maps to "small". It is completely unintuitive. An "alias" is a shortened name for a commonly used model (and you should not have too many of these since commonly used by definition means a few). This is what I propose:
I would even argue that for All names need to be processed in a case-insensitive manner. @Andrew-S-Rosen Welcome your views too. |
|
Thanks for the ping and for tackling this, @rul048. While conceptually "simple", I think #120 is incredibly important to address and am glad you are working on this. I agree with your comments, @shyuep.
I do not have much of an opinion so long as it is internally consistent and captures all the necessary nuance.
Presumably, if a TensorNet-MatPES-PBE-v2026.1 were to come out next month, then TensorNet-PBE would be remapped to this newer version in a future release of matgl. On one hand, it makes sense to return the "best" version for the user. Of course, this downside of this is that when the alias is remapped, then there will be breaking changes if the user upgrades without reading a CHANGELOG. I don't see a way around that. You get what you ask for with the alias.
Yup. |
Change names from list comprehension to set comprehension. Signed-off-by: Runze Liu <146490083+rul048@users.noreply.github.com>
|
Compared to the previous version, the latest commit splits model resolution into two explicit stages. It introduces two separate mappings: ID_TO_ALIAS (user-facing aliases -> canonical IDs) and ID_TO_NAME (canonical IDs -> backend-specific model names). This makes the canonical identifier the shared “source of truth” inside MatCalc, while aliases become purely a user-compatibility layer, and backend names become purely an implementation detail. Along with this change, the PR description now formalizes a simpler canonical identifier format at the MatCalc level (Architecture-Dataset-Functional-Version-Size) rather than treating the canonical name as a variable-length. |
|
|
||
| # Keys must be lowercase and represent canonical identifiers | ||
| # Values are the actual model names passed to the backend libraries. | ||
| ID_TO_NAME = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't bother ensuring IDs are lower case here. It is better to use the proper name capitalization. When checking, we can make it a non-case-sensitive check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I’ve kept the canonical capitalization for the IDs (eg., TensorNet-MatPES-r2SCAN-v2025.1-S), and the resolution is now case-insensitive when matching user input.
src/matcalc/utils.py
Outdated
| } | ||
|
|
||
| # Common aliases and abbreviations will load the most advanced or widely used model. | ||
| ID_TO_ALIAS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is non-intuitive. It should be ALIAS_TO_ID. And it is a many-to-one mapping.
| "M3GNet-MatPES-r2SCAN-v2025.1-S": "M3GNet-MatPES-r2SCAN-v2025.1-PES", | ||
| "CHGNet-MatPES-PBE-v2025.2-M": "CHGNet-MatPES-PBE-2025.2.10-2.7M-PES", | ||
| "CHGNet-MatPES-r2SCAN-v2025.2-M": "CHGNet-MatPES-r2SCAN-2025.2.10-2.7M-PES", | ||
| "MACE-MP-PBE-0-S": "small", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in my previous review, the mapping should not be to MACE-MP-PBE-0-S to small. What does small even mean? This is the argument to the MACE init method. In that case, you should parse "MACE-MP-PBE-0-S" to note that the last letter is S and use small as the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know. I agree that it is much better if we can parse the naming pattern rather than mapping. The issue here is the backend MACE library uses model argument to load model, but the naming across different families are not consistent, so the same semantic “size” appears in different positions and formats.
For original MPtrj trained models: the model itself is directly [size], like small/medium/large;
For modified MPtrj trained models: the model is [size]-[version], like small-0b2;
For MPA or OMAT trained models, the model is [size]-[dataset]-[version], like small-omat-0;
For MATPES trained models, the model is [architecture]-[dataset]-[functional]-[version].
For parsing these cases, we first need to determine which model family a given canonical identifier belongs to, and then construct the model argument in whatever format the MACE backend expects for that specific family. I dont find a good way to parse all these types effectively. The implementation will be more complex than direct mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if you want to do mapping, the MACE specific mappings can be within the mace loading part of the code. It should not be that we create chaos on the general ID to name mapping for one code. Anything that is specific to one model family should be isolated within that model family itself.
Summary
This PR standardizes how MatCalc loads universal MLIPs across multiple backend libraries (MatGL, MACE, GRACE, etc.) as a solution of Issue #120.
Different backend libraries currently use inconsistent or non-canonical model names (e.g., "small-omat-0", "medium-mpa-0", "GRACE-1L-OMAT-medium-base", "TensorNet-MatPES-PBE-v2025.1-PES"), which leads to ambiguity, user confusion, and repeated conversion logic scattered across the codebase.
To fix this, the PR introduces a unified model naming convention. The canonical identifier format is:
Architecture-Dataset-Functional-Version-Size
Examples:
The alias conversion table (ID_TO_ALIAS) can map the commonly used case-insensitive abbreviations or allies (eg., MACE-MP-0, MACE-MP-PBE-0, MACE-MP-0-M) entered by the user to the canonical identifiers used in MatCalc. And the backend name conversion table (ID_TO_NAME) can map these identifiers to the model names that each backend library actually uses so that users can write:
Use abbreviations will load the most advanced model in the model family.
Major changes:
Todos
If this is work in progress, what else needs to be done?
Checklist
ruff.mypy.duecredit@due.dcitedecorators to reference relevant papers by DOI (example)Tip: Install
pre-commithooks to auto-check types and linting before every commit: