Skip to content

Conversation

@itsoyou
Copy link
Contributor

@itsoyou itsoyou commented Oct 9, 2025

Linked Media MSC3911 AP?: Deduplication- file path with sha256 #3465

Fixes: #3465

Currently we are storing the file based on media_id. With msc3911 introduced, there are many copy actions and that creates duplicated media. So now we try to save file with sha256 path and avoid duplication.

There is a feature flag introduce with use_sha256_paths. But even with feature flag is on, we still want to the existing media_id path logic function well, and backward compatible. Feature flag also prevents the existing tests failure.

Previously files were saved like this:

media/local_content/Ab/cD/...
media/local_thumbnails/Ab/cD/.../32-32-image-png-crop
media/remote_content/{server_name}/Ab/cD/...

But with use_sha256_paths enabled, media storage structure would be like this:

media/eb/f4/...
thumbnails/eb/f4/.../32-32-image-png-crop

The logic to save file with sha256 is quite simple. When we save a new file, it has to be sha256 paths, when we want to retrieve the file or delete the file, it should work for both sha256 path and media_id path.

When we save the media with sha256 path, the FileInfo() object must include the new field sha256.

@itsoyou itsoyou marked this pull request as ready for review October 26, 2025 22:40
@itsoyou itsoyou requested a review from a team as a code owner October 26, 2025 22:40
@itsoyou itsoyou changed the title WIP: deduplication: save file with sha path MSC3911: Deduplication - save file with sha256 path Oct 27, 2025
Copy link
Contributor

@jason-famedly jason-famedly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's look at this part and come back for another round

Comment on lines 335 to 337
sha256=media_info.sha256
if self.use_sha256_paths and media_info.sha256
else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If media_info.sha256 can be a value or None, do we need the condition at all(or any of the similar below)? Especially if using the sha256 for the path is guarded against in the function that produces the path, MediaStorage._file_info_to_path()(provided the other comment about merging _file_info_to_sha256_path into it is considered)?

jason-famedly added a commit that referenced this pull request Dec 8, 2025
A little bit of clean up. Abstracting how the local path to the file is
resolved. Ensure consistency when removing media, that if there are
still references to the media that only the metadata is removed until it
is the last reference.

Resolving concerns about moving existing media by leaving it in place
and only new media is de-duplicated.

Stage incoming remote media into it's legacy file path until the sha256
can be discovered, and only after it is written to the database metadata
is it moved into it's new place. This specific set of steps works with
the file path resolving, so if the moving process is interrupted it will
still be usable.

Always provide `StorageProvider` classes with a consistent relative
"path". For a media id of `1234567890`, it would break down to
`12/34/567890`. Do not use the sha256 for this path. `StorageProvider`s
are provided the `FileInfo` class as well, if they want to do media
de-dupe they can get it from there. Should we provide an enhancement to
the bundled `StorageProvider` to allow for this capability?

Depends on #155
itsoyou and others added 3 commits December 8, 2025 18:41
deletes

adjust some testing

moved this test case to a place that supports a thread pool, and broke
it into several testable components

remove unused stage_into_file() function

Apply feedback about resolve_path_to_media_file()

Lose this TODO and do not recreate the IOBytes object

Put back this new line, avoid merge conflicts

Abstract actual duplicate media deletion into it's own subroutine

Clean up counting how many references there are to a given media object
based on sha256

Update test to clarify some comments about behavior

Fix copypasta comment without update

lint
I guess because of a rust update?
@itsoyou itsoyou force-pushed the syk/file-path-with-sha branch from cff725d to 29bb918 Compare December 8, 2025 18:16
Comment on lines 5134 to 5160
def make_homeserver(self, reactor: MemoryReactor, clock: Clock) -> HomeServer:
config = self.default_config()
config.setdefault("experimental_features", {})
config["experimental_features"].update({"msc3911": {"enabled": True}})
config["enable_local_media_storage_deduplication"] = True
self.storage_path = self.mktemp()
self.media_store_path = self.mktemp()
os.mkdir(self.storage_path)
os.mkdir(self.media_store_path)
config["media_store_path"] = self.media_store_path
provider_config = {
"module": "synapse.media.storage_provider.FileStorageProviderBackend",
"store_local": True,
"store_synchronous": False,
"store_remote": True,
"config": {"directory": self.storage_path},
}
config["media_storage_providers"] = [provider_config]
return self.setup_test_homeserver(config=config)

def prepare(self, reactor: MemoryReactor, clock: Clock, hs: HomeServer) -> None:
self.test_dir = tempfile.mkdtemp(prefix="synapse-tests-")
self.addCleanup(shutil.rmtree, self.test_dir)
self.repo = hs.get_media_repository()
self.user = self.register_user("user", "pass")
self.tok = self.login("user", "pass")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only just caught this. Look in make_homeserver() where we are making directories, and then down in prepare() where we are making some more, but not using them? I tried scanning around for self.test_dir and didn't spot it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I looked at your changes and realized I may have miscommunicated what I meant. I think we can lose the directory creation in prepare(). At this stage in the in the bring up of the server and preparation of the test suite, the homeserver is already running and it's too late to inject a different directory, it's up in make_homeserver() as that is for setup that occurs before the server is started(that's where default_config() gets called too). Actually, do we need to create this directory at all? It should already be created for us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants