-
Notifications
You must be signed in to change notification settings - Fork 96
Description
Currently we are stalling on #256, which appears on the surface to be a bug fix, but in reality is continuing the implementation of the replacement contributors (#260) now that we are attempting to run the feature through integration testing. During the course of this work, some problems with the current software design were highlighted. I've held off on stating this kind of feedback for a while, everyone has something to complain about when working on a legacy project, but now that we are quite a few months into this, which was supposed to have been implemented in a few weeks, and we have a better understanding of the scope of the project, perhaps it is the correct time to highlight some technical debt which may be impeding progress:
Side Effects and Global State
There are many unexpected side-effects hidden in methods, making it difficult to reason about changes and causes to bugs. Prime among these is the global state of the Storage layer, which can be modified at any point in coordinator, and which is also modified by the web server hosting the coordinator. Modification of the task queues in different places is also a great source of complexity and potential problems.
What modification of state is required to drop a contributor? The answer to this question is spread out among at least 3 files. The round state is handed separately and in different places to the storage layer, which allows them to easily get out of sync, causing confusing problems. A more functional style approach separating business logic calculations from the performing of important actions with effects would make the code a lot easier to reason about. All modification of important state should be concentrated to one place where it is easy to test, and easier to reason about.
Multiple Task Queues
Currently we have multiple queues for tasks, per participant, and assigned tasks get moved between them at different points during the ceremony, it's possible for tasks to go missing or get duplicated between queues. Each queue actually represents the state of a given task. This is a source of confusion and bugs while working on the coordinator_state.rs module.
We could perhaps instead store the tasks in a single master list, however we still need to consider concurrent read/write access and iteration/cache coherence performance implications. There is currently a maximum of about 40000 tasks (universal setup, 4096 chunks * 5 participants * 2 (contribution + verification)). At the moment there is no concurrent read/write access, as the entire coordinator state is behind an RwLock. If traversing the entire task list is a problem then tasks could be maintained in per-participant buckets, perhaps using something like dashmap to allow bucket independent mutability, but still eliminating the separate queues for task state.
Chunk Locking
Currently, from my understanding, the external API for the locking of chunks does not appear to be inherently needed to perform the requirements of the of the coordinator. It appears to be an artefact of the implementation, which introduces unnecessary complexity. If the coordinator is deciding which tasks that contributors and verifiers should perform at a given moment in time, there should be no need to expose this in the frontend. Internally there may be some use to locking chunks for verifiers, but perhaps there may be other simpler or more reliable ways to design this.
"Unsafe" Methods
There are a lot of concerning TODO's and concerning "unsafe" (not in the Rust sense) labelled methods, which also might be an indication of attempts to work around fundamental inadequacies of the design for what we are trying to do. My feeling is that we should not be feeling the need to label methods as unsafe if we structure the design of the code such that they are only called when it is perfectly sensible to do so (again, concentrating important changes to global state to a single place).
Storage Layer
The storage layer appears to be built around an abstraction that allows the use of some third party storage layer which is not a file system (some kind of file upload service, or database). This abstraction was never put to use, however it's probably too core to the system to remove, and it might still serve its purpose in the future. The important thing to note is that the web server has access to this layer, and is expected to place contribution files there, this could be a source of problems, changing global state under the feet of the coordinator logic. It could be better to have the contribution files passed directly in via the contribution API methods, and have the coordinator assign them to storage.
There is currently no performance benefit to having some third party service place the files in storage because the coordinator needs to download the files anyway to perform aggregation. The same with obtaining files to serve them, this could also be performed via the API, rather than some backdoor file system access.
The one problem I can see with such an approach is that if we want to distribute aggregation (a bit of a major bottleneck for universal setup), there is no longer a requirement to having the actual files stored with the coordinator. So as an alternative, at least we could have the locator to the files passed in via these methods rather than assuming that they have been put in the correct place. This would require less change to the current design. However having files not stored on the coordinator might make it more complicated to perform a backup.
What to do?
For context, in total there is currently about 5000 total lines of code in coordinator.rs and coordinator_state.rs, including unit tests. I have started work on a detailed analysis of the requirements for this code in #266 which should also help estimate how long we think a major refactor might take. My personal take is that the problem at hand is not super complicated, in some ways the project has been over-engineered, I'm going to guess maybe 1 month with 1 developer to rewrite these modules.
Such changes always take longer than expected, and we should probably increase our worst case estimates. We should also be careful not to suffer from the sunk-cost fallacy. The time spent so far with the project was not a waste, because it gained us valuable insight into the requirements and the problem domain, enabling the production of a much better design than if we had decided to make major changes when @ibaryshnikov and I were first brought on to the project. We also now have a much more comprehensive set of tests, including integration tests.
I'm still going to spend the next couple of days attempting to complete #256, however if it continues to highlight problems, this proposition would be my vote. Our lack of confidence in the code-base still might come back to bite us if we get problems in production or need to implement more features.