-
Notifications
You must be signed in to change notification settings - Fork 4
MSC3911: Deduplication - save file with sha256 path #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: msc3911
Are you sure you want to change the base?
Conversation
jason-famedly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's look at this part and come back for another round
synapse/media/thumbnailer.py
Outdated
| sha256=media_info.sha256 | ||
| if self.use_sha256_paths and media_info.sha256 | ||
| else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If media_info.sha256 can be a value or None, do we need the condition at all(or any of the similar below)? Especially if using the sha256 for the path is guarded against in the function that produces the path, MediaStorage._file_info_to_path()(provided the other comment about merging _file_info_to_sha256_path into it is considered)?
A little bit of clean up. Abstracting how the local path to the file is resolved. Ensure consistency when removing media, that if there are still references to the media that only the metadata is removed until it is the last reference. Resolving concerns about moving existing media by leaving it in place and only new media is de-duplicated. Stage incoming remote media into it's legacy file path until the sha256 can be discovered, and only after it is written to the database metadata is it moved into it's new place. This specific set of steps works with the file path resolving, so if the moving process is interrupted it will still be usable. Always provide `StorageProvider` classes with a consistent relative "path". For a media id of `1234567890`, it would break down to `12/34/567890`. Do not use the sha256 for this path. `StorageProvider`s are provided the `FileInfo` class as well, if they want to do media de-dupe they can get it from there. Should we provide an enhancement to the bundled `StorageProvider` to allow for this capability? Depends on #155
deletes adjust some testing moved this test case to a place that supports a thread pool, and broke it into several testable components remove unused stage_into_file() function Apply feedback about resolve_path_to_media_file() Lose this TODO and do not recreate the IOBytes object Put back this new line, avoid merge conflicts Abstract actual duplicate media deletion into it's own subroutine Clean up counting how many references there are to a given media object based on sha256 Update test to clarify some comments about behavior Fix copypasta comment without update lint
I guess because of a rust update?
cff725d to
29bb918
Compare
tests/rest/client/test_media.py
Outdated
| def make_homeserver(self, reactor: MemoryReactor, clock: Clock) -> HomeServer: | ||
| config = self.default_config() | ||
| config.setdefault("experimental_features", {}) | ||
| config["experimental_features"].update({"msc3911": {"enabled": True}}) | ||
| config["enable_local_media_storage_deduplication"] = True | ||
| self.storage_path = self.mktemp() | ||
| self.media_store_path = self.mktemp() | ||
| os.mkdir(self.storage_path) | ||
| os.mkdir(self.media_store_path) | ||
| config["media_store_path"] = self.media_store_path | ||
| provider_config = { | ||
| "module": "synapse.media.storage_provider.FileStorageProviderBackend", | ||
| "store_local": True, | ||
| "store_synchronous": False, | ||
| "store_remote": True, | ||
| "config": {"directory": self.storage_path}, | ||
| } | ||
| config["media_storage_providers"] = [provider_config] | ||
| return self.setup_test_homeserver(config=config) | ||
|
|
||
| def prepare(self, reactor: MemoryReactor, clock: Clock, hs: HomeServer) -> None: | ||
| self.test_dir = tempfile.mkdtemp(prefix="synapse-tests-") | ||
| self.addCleanup(shutil.rmtree, self.test_dir) | ||
| self.repo = hs.get_media_repository() | ||
| self.user = self.register_user("user", "pass") | ||
| self.tok = self.login("user", "pass") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only just caught this. Look in make_homeserver() where we are making directories, and then down in prepare() where we are making some more, but not using them? I tried scanning around for self.test_dir and didn't spot it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I looked at your changes and realized I may have miscommunicated what I meant. I think we can lose the directory creation in prepare(). At this stage in the in the bring up of the server and preparation of the test suite, the homeserver is already running and it's too late to inject a different directory, it's up in make_homeserver() as that is for setup that occurs before the server is started(that's where default_config() gets called too). Actually, do we need to create this directory at all? It should already be created for us
…me != avatar url)
… the test suite, and force some tests to always use isolated media paths
Linked Media MSC3911 AP?: Deduplication- file path with sha256 #3465
Fixes: #3465
Currently we are storing the file based on
media_id. With msc3911 introduced, there are many copy actions and that creates duplicated media. So now we try to save file with sha256 path and avoid duplication.There is a feature flag introduce with
use_sha256_paths. But even with feature flag is on, we still want to the existing media_id path logic function well, and backward compatible. Feature flag also prevents the existing tests failure.Previously files were saved like this:
But with
use_sha256_pathsenabled, media storage structure would be like this:The logic to save file with sha256 is quite simple. When we save a new file, it has to be sha256 paths, when we want to retrieve the file or delete the file, it should work for both sha256 path and media_id path.
When we save the media with sha256 path, the
FileInfo()object must include the new fieldsha256.