-
Notifications
You must be signed in to change notification settings - Fork 130
Mixed-precision solver #6521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Mixed-precision solver #6521
Conversation
|
jenkins build this please |
|
i see mike drop. |
|
1.6 speedup is very promising @nrseman, great work!
|
I have renamed all |
|
Thank you for your interest and your questions, @multitalentloes. I don't want to start a detailed technical discussion in this PR request. I'll contact you separately. |
multitalentloes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution to OPM, I have some surface-level feedback on the code right now, and also taken the opportunity to note down some of the goals for this code further down that line.
Immediate code-fixes:
- Compilation error when compiling with CUDA/HIP (see comment)
- Compilation error type mismatch during variable assignment (see comment)
- Ensure that this PR does not break OPM compilation for ARM CPUs (as they do not support avx2)
General stuff:
- Use docstrings, see examples from other files in the codebase
- Use
clang-formatto make it more consistent with the rest of the codebase - Use
fmt::printinstead ofprintf
Future work:
- Registering the preconditioner in the factory (currently trivial placeholder). This will make it available to other multi-stage preconditioners in OPM, for instance letting us combine it with the default CPRW preconditioner, which is would have a great impact if performance is improved! From what we discussed outside of this PR this preconditioner is implemented with knowledge of how the BiCGSTAB is performed and that is something that should be looked more into.
- Having these preconditioners registered in the MPI factory as well with wrapBlockPreconditioner so it can be used in parallel simulations
| DUNE_UNUSED_PARAMETER(prm); | ||
| return std::make_shared<MultithreadDILU<M, V, V>>(op.getmat()); | ||
| }); | ||
| F::addCreator("mixed-ilu0", [](const O& op, const P& prm, const std::function<V()>&, std::size_t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line produces compilation warnings due to unused variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by adding
DUNE_UNUSED_PARAMETER(op);and will be included in my next update.
| DUNE_UNUSED_PARAMETER(prm); | ||
| return std::make_shared<TrivialPreconditioner<V,V>>(); | ||
| }); | ||
| F::addCreator("mixed-dilu", [](const O& op, const P& prm, const std::function<V()>&, std::size_t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused variable warnings here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed by adding
DUNE_UNUSED_PARAMETER(op);and will be included in my next update.
| virtual void update() override {}; | ||
| virtual bool hasPerfectUpdate() const override {return true;} | ||
| virtual void pre (X& x, Y& y) override {}; | ||
| virtual void post (X& x) override {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused variable warnings for all of these functions with arguments as well with the empty body. I think this has been resolved other places in the code with [[maybe_unused]] or a similar attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved by prepending all arguments by [[maybe_unused]] and will be part of my next update.
| { | ||
| int nrows = A.N(); | ||
| int nnz = A.nonzeroes(); | ||
| int b = A[0][0].N(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
M might be a GpuSparseMatrix when compiling on a machine with a GPU, so line will cause problems because using the [] operator is not available for that class. Probably easiest to do some type logic in the FlexibleSolver to avoid this. Causes compilation error out of the box for me.
| printf("n = %lu\n",x.dim()); | ||
| } | ||
|
|
||
| void exportSparsity(const char *path='.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
causes compilation error '.' is a char, replace with "." to make it a string/char*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, I don't get a compliation error, but your observation is correct and I have made the fix.
opm/simulators/linalg/mixed/prec.c
Outdated
|
|
||
|
|
||
| /* | ||
| void mat3_view(double const *M, char const *name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove dead code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| } | ||
|
|
||
| int bslv_pbicgstab3m(bslv_memory *mem, bsr_matrix *A, const double *b, double *x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general for new functions: Use docstring to document what the function does and what its expected inputs, outputs, and special assumptions are to make the implementation more accessible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@multitalentloes: I'm a bit confused about what you mean by docstring. Do you simply mean a multiiline comment or is it required to adhere to the doxygen standard? I see a mix in the current code. Also, is there a policy on whether the documentation is attached to the function definition of the function declaration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally it should adhere to the doxygen standard for those who use that. I would guess declaration is best, at least that is what I have done when declaration is in a separate headerfile
opm/simulators/linalg/mixed/prec.c
Outdated
|
|
||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use clang-format on all the files to ensure that things are compatible with existing code, easiest to just to it now from the start to avoid for instance weird whitespacing such as these blank lines
| { | ||
| public: | ||
|
|
||
| MixedSolver(M A, double tol, int maxiter, bool use_dilu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit dubious to taking a copy of A here, it seems to work for some inexplicable reasons, but why not take a const ref (const M& A)? I think as it stands, the code is storing a pointer (data_) to a memory location that will be deleted after invocation. No idea why this works in practice here, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that this works implies that a copy of the matrix object A is a shallow copy. Nevertheless, your solution is better and the change will be included in my next update.
|
I have addressed all trivial issues and pushed the updates to the PR branch. What remains is the following:
|
|
@blattms: Would the file below work for the compilation test we discussed? I have verified that the code compiles with the following command gcc -mavx2 -c test_avx2.ctest_avx2.c#include <immintrin.h>
typedef
struct bsr_matrix
{
int nrows;
int ncols;
int nnz;
int b;
int *rowptr;
int *colidx;
double *dbl;
float *flt;
} bsr_matrix;
void bsr_vmspmv3(bsr_matrix *A, const double *x, double *y)
{
int nrows = A->nrows;
int *rowptr=A->rowptr;
int *colidx=A->colidx;
const float *data=A->flt;
const int b=3;
__m256d mm_zeros =_mm256_setzero_pd();
for(int i=0;i<nrows;i++)
{
__m256d vA[3];
for(int k=0;k<3;k++) vA[k] = mm_zeros;
for(int k=rowptr[i];k<rowptr[i+1];k++)
{
const float *AA=data+9*k;
int j = colidx[k];
__m256d vx = _mm256_loadu_pd(x+b*j);
vA[0] += _mm256_cvtps_pd(_mm_loadu_ps(AA+0))*_mm256_permute4x64_pd(vx,0b00000000); //0b01010101
vA[1] += _mm256_cvtps_pd(_mm_loadu_ps(AA+3))*_mm256_permute4x64_pd(vx,0b01010101); //0b01010101
vA[2] += _mm256_cvtps_pd(_mm_loadu_ps(AA+6))*_mm256_permute4x64_pd(vx,0b10101010); //0b01010101
}
// sum over columns
__m256d vy, vz;
vz = vA[0] + vA[1] + vA[2];
double *y_i = y+b*i;
vy = _mm256_loadu_pd(y_i); // optional blend to keep
vz =_mm256_blend_pd(vy,vz,0x7); // 4th element unchanged
_mm256_storeu_pd(y_i,vz);
}
} |
|
@nrseman I think the only thing needed for this to compile on with GPU support enabled is the change in |
|
@kjetilly: After updating my toolchain I ran into the |
Please add a main method to make this compile. We need to be able to run this program. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be some kind of tests that makes sure that the solver runs at least. Currently, we will only know whether it compiles.
If you do C allocations, then you should probably check the results. I am not sure whether the memory is sufficient for arrays that are later realigned.
Is there are any code that checks that:
- the blocks have the correct size
- that the run is serial and not parallel
- etc.
We need to prevent users from misusing this.
I might have missed something of course.
opm/simulators/linalg/mixed/bslv.c
Outdated
| @@ -0,0 +1,260 @@ | |||
| #define _POSIX_C_SOURCE 200809L // required for posix_memalgin in <stdlib.h> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance we can get rid of this in favor of a compiler argument or requesting a certain C standard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reasonwith my toolchain adding -std=c11 does not fix this, but -D_ISOC11_SOURCE does. I removed the define statement from the code, but the macro definition needs to be added to by cmake. Right now I do that manually though -DCMAKE_C_FLAGS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we do not use posix_memalign anymore. Hence we don't need it. Seems resolved, right?
|
|
||
| prec_t *prec_alloc() | ||
| { | ||
| prec_t *P = malloc(sizeof(prec_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A check seems missing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now checking with assert.
|
|
||
| bsr_matrix* bsr_alloc() | ||
| { | ||
| bsr_matrix *A=malloc(sizeof(bsr_matrix)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result of malloc should probably be checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now checking with assert.
| // offsets for off-diagonal updates | ||
| int count; | ||
| count = prec_analyze(U,NULL); | ||
| P->offsets = malloc(3*(count+1)*sizeof(int)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check missing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now checking with assert.
|
|
||
| bslv_memory *bslv_alloc() | ||
| { | ||
| bslv_memory *mem = malloc(sizeof(bslv_memory)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now checking with assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good enough for OPM, because we never compile with -DNDEBUG to not make our simulator to fast 😅 One day that might change, though.
For "normal" projects I would use a check that even triggers in production. Asserts should be used to catch programming errors. This seems to be something that happens if system resources are exhausted.
| { | ||
| int nrows; | ||
| int ncols; | ||
| int nnz; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does using int for nnz limit us too much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we start running models with more than 2 billion connections (~25M cells) then yes.
Let us just not lose sight of the fact that we are trying to get a fundamentally new feature into the simulator. Given that the current implementation only allows sequential runs, it is very unlikey that anyone will try to apply it to extrelemy large models.
|
Unfortunately, you will need to rebase this because of conflicts, |
|
This module might get used by others. Please make sure that the new headers are installed by listing them in CMakeLists_files.cmake variable PUBLIC_HEADER_FILES |
|
I am also getting some warnings: I have not checked this and it might be false postives on architecture ppc64el. This is for a rebased version. |
|
It seems like I can call the simulator without --matrix-add-well-contributions=true and do not get an error. What does happen in this case? Maybe we should enforce that option somehow? |
|
|
||
| // mixed-precision ILU0 | ||
| if (conf == "mixed-dilu") { | ||
| return setupMixedDILU(conf, p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message below should probably be adapted.
Also we need to skip this or fallback to something different once my PR is merged and we do not have avx2 .
These two items have been addressd in my recent push by throwing if either of the above conditions are met. |
@blattms:I am not sure what you mean here. The only padding occuring is in the initializion of |
@blattms: As mentioned above, all allocations are now followed by asserts to ensure that they were successful. |
|
@multitalentloes: You have been quiet for a very long time. Do you mind summarizing your remaining must have requirements for merging in a check list similar to what @blattms provided above? This PR has been open for almost 2 months. It is time to get this done ... |
|
I came across another compiler warning: Let's see whether jenkins agrees. |
the comment was for a previous version where you used posix_memalign to change alignment for A->flt |
|
jenkins build this serial rocm hipify please |
|
Compilation with jenkins fails because of forbidden conversions. I'll add a task to my list above. |
|
@blattms: the link to jenkins ci does not work for me |
@blattms: I pushed another update that addresses the items above. The changes use For completeness, I include my full cmake invokation here: cmake -D CMAKE_CXX_COMPILER="g++" \
-D CMAKE_C_COMPILER="gcc" \
-D CMAKE_C_FLAGS="-O3 -std=c11 -D_ISOC11_SOURCE -march=native" \
-D CMAKE_VERBOSE_MAKEFILE=1 \
-D HAVE_AVX2_EXTENSION=TRUE \
..
|
|
@blattms: Note the |
For this we remove the GCC pragmas and set the needed compiler flag when available.
|
You are, that was a problem with the incorrect C standard. |
I cannot persuade CMake to use C11. Only C17 seems to work.
|
I now used OPM/opm-common#4840 and #6638 (which is based on this one with a few fixes and the same as haugenlabs#1) to test compilation. That fails, see https://ci.opm-project.org/job/opm-common-PR-builder/9096/console To compile error is for a case where the matrix stores float instead of double. |
|
Just pushed an update with @blattms changes to support auto detection of avx2 support. Configuration will only succeed when using a version of |
|
@blattms: I added a #if HAVE_AVX2_EXTENSION
} else if (solver_type == "mixed-bicgstab") {
if constexpr (Opm::is_gpu_operator_v<Operator>) {
OPM_THROW(std::invalid_argument, "mixed-bicgstab solver not supported for GPU operatorsg");
} else if constexpr (std::is_same_v<typename VectorType::field_type, float>){
OPM_THROW(std::invalid_argument, "mixed-bicgstab solver not supported for single precision.");
} else {
const std::string prec_type = prm.get<std::string>("preconditioner.type", "error");
bool use_mixed_dilu= (prec_type=="mixed-dilu");
using MatrixType = decltype(linearoperator_for_solver_->getmat());
linsolver_ = std::make_shared<Dune::MixedSolver<VectorType,MatrixType>>(
linearoperator_for_solver_->getmat(),
tol,
maxiter,
use_mixed_dilu
);
#endifThe code does the right thing when I've pushed my latest version to a new branch |
|
This issue seems unrelated to my PR. The current compilation failures occur when |
Bump ... |
| @@ -0,0 +1,23 @@ | |||
| # Mixed-precision linear solvers | |||
| This folder contains mixed-precision building blocks for Krylov subspace methods | |||
| and a highly optimized mixed-precision implementation of ILU0 preconditioned bicgstab. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DILU as well?
|
|
||
| // Solver memory struct | ||
| typedef | ||
| struct bslv_memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document this struct too
| _mm256_storeu_pd(xi,vz); | ||
|
|
||
| //double z[4]; | ||
| //_mm256_store_pd(z,vz); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some lines of dead code that look like they can be removed
| # Mixed-precision linear solvers | ||
| This folder contains mixed-precision building blocks for Krylov subspace methods | ||
| and a highly optimized mixed-precision implementation of ILU0 preconditioned bicgstab. | ||
| Hopefully, this will inspire the exploration of mixed-precision algorithms in OPM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some more in this readme about what makes it mixed-precision, what is computed and what is stored in what precision. It would make it easier to put in the context of the existing mixed-precision work in OPM and their corresponding publications.
| for(int k=0;k<9;k++) C[k]=M[k]; | ||
| } | ||
|
|
||
| void mat3_matfms(double *C, const double *A, const double *B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and other matrix helper functions are not documented
Maybe also extract utility functions as this one to a separate file as it is not really specific to the precs themselves?
|
Sorry for the delay. Here is an updated list of things that should be in place before merging:
With these resolved then we can get this preliminary version merged that should be expanded upon in the way discussed earlier in this PR. |
This PR introduces mixed-precision building blocks for Krylov subspace methods and a highly optimized mixed-precision implementation of ILU0 + BiCGSTAB. Initial testing indicate 1.6x speed-up per linear iteration with minimal impact on iteration count. For information on how to activate the new solver see the README.md file in the
opm/simulators/linalg/mixedfolder.Looking forward to your feedback.
PS! After rebasing on master yesterday, the SPE10 case I used as one of my performance tests failed endpoint validation due to negative SOGCR. I have not investigated further, but it seems to be a finite precision issue. You can see my hack to work around the issue in the
mixed-devbranch on my fork