feat: Provide unified VFS for different protocol file systems#77
feat: Provide unified VFS for different protocol file systems#77shirly121 wants to merge 4 commits intoalibaba:mainfrom
Conversation
Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
@greptile |
| } | ||
|
|
||
| MetadataManager::~MetadataManager() = default; | ||
| MetadataManager::~MetadataManager() { MetadataRegistry::unregisterMetadata(); } |
There was a problem hiding this comment.
Destructor unconditionally clears registry regardless of which instance is registered
MetadataManager::~MetadataManager() always calls MetadataRegistry::unregisterMetadata(), which sets the static metadataManager pointer to nullptr. This means if ANY MetadataManager instance is destroyed — even one that was never registered — it silently invalidates the current registry entry.
Consider a scenario where a second (unregistered) MetadataManager instance is created and then destroyed (e.g., a local/temporary instance in a unit test or helper function). Its destructor would clear the registry, leaving the primary, still-live MetadataManager unreachable via MetadataRegistry::getVFS() or MetadataRegistry::getCatalog(), causing subsequent calls to throw.
A safer pattern guards on identity before clearing:
MetadataManager::~MetadataManager() {
// Only unregister if *this* instance is the currently registered one
if (MetadataRegistry::metadataManager == this) {
MetadataRegistry::unregisterMetadata();
}
}This requires either making metadataManager accessible (e.g., via a friend declaration already in place) or adding a guarded unregisterMetadata(MetadataManager*) overload.
| const auto& vfs = neug::main::MetadataRegistry::getVFS(); | ||
| const auto& fs = vfs->Provide(state->schema.file); | ||
| auto resolvedPaths = std::vector<std::string>(); |
There was a problem hiding this comment.
const auto& bound to a temporary unique_ptr is misleading about ownership
vfs->Provide(...) returns a std::unique_ptr<FileSystem> by value. Binding a const auto& to this prvalue does extend the temporary's lifetime to the reference's scope (C++ [class.temporary]), but the pattern is fragile and unusual: it prevents moving the pointer, and readers may not immediately recognize that the object is owned by the reference here rather than held elsewhere.
The same pattern appears in:
include/neug/compiler/function/csv_read_function.hlines 119–120 (bothexecFuncandsniffFunc)extension/json/include/json_read_function.hlines 47–48 and 79–80src/compiler/function/csv_export_function.cpplines 63–64
Prefer the idiomatic owning declaration:
| const auto& vfs = neug::main::MetadataRegistry::getVFS(); | |
| const auto& fs = vfs->Provide(state->schema.file); | |
| auto resolvedPaths = std::vector<std::string>(); | |
| const auto& vfs = neug::main::MetadataRegistry::getVFS(); | |
| auto fs = vfs->Provide(state->schema.file); |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| std::string protocol = schema.protocol; | ||
| if (protocol.empty()) { | ||
| const auto& paths = schema.paths; | ||
| if (paths.empty()) { | ||
| THROW_INVALID_ARGUMENT_EXCEPTION("No file paths provided"); | ||
| } | ||
| // we assume all paths share the same protocol | ||
| const auto& path = paths[0]; | ||
| auto pos = path.find("://"); | ||
| if (pos != std::string::npos) { | ||
| protocol = path.substr(0, pos); | ||
| } else { | ||
| protocol = "file"; | ||
| } | ||
| } |
There was a problem hiding this comment.
Mixed-protocol path lists are silently misrouted
When schema.protocol is empty, Provide() infers the protocol from only paths[0] and uses that single protocol to dispatch all paths. The comment acknowledges this assumption (// we assume all paths share the same protocol), but there is no validation that the remaining paths actually honour it.
If a caller passes paths = ["s3://bucket/a.csv", "file:///local/b.csv"], the s3 factory is selected and applied to both paths. The local path would be handed to the S3 file system, likely causing a confusing runtime error far from the source of the mistake.
Consider validating protocol consistency across all paths, or at least document this as a precondition that callers are responsible for enforcing:
// After determining `protocol` from paths[0]:
for (size_t i = 1; i < paths.size(); ++i) {
auto pos = paths[i].find("://");
std::string p = (pos != std::string::npos) ? paths[i].substr(0, pos) : "file";
if (p != protocol) {
THROW_INVALID_ARGUMENT_EXCEPTION(
"All paths must use the same protocol; got '" + protocol +
"' and '" + p + "'");
}
}
What do these changes do?
Related issue number
Fixes #55
Greptile Summary
This PR replaces the old Kùzu-derived
VirtualFileSystem/LocalFileSystem/CompressedFileSystemstack (underneug::common) with a new, protocol-orientedFileSystemRegistry(underneug::fsys), and introduces aMetadataRegistrysingleton to give any component access to the active VFS and catalog without threading aClientContextpointer everywhere. The change also removes theEXPORT DATABASE/IMPORT DATABASEbinder paths, theGVfsHoldergopt component, and a number of WAL/storage headers that were no longer in use.Key changes:
include/neug/utils/file_sys/file_system.h/src/utils/file_sys/file_system.cc: a clean factory-registry (FileSystemRegistry) with ashared_mutex-protected map of protocol →FileSystemFactorylambdas; a defaultLocalFileSystemfactory is registered for the"file"protocol.MetadataRegistrystatic class providinggetVFS()andgetCatalog()accessors, replacing the now-removedGVfsHolderandGCatalogHoldergopt components.LocalFileSystemProvidertoMetadataRegistry::getVFS()->Provide(schema).ExtensionAPI::registerFileSystem()added so external extensions can plug in protocol-specific file systems (e.g. S3, OSS).GOptTest::TearDownnow callsMetadataRegistry::unregisterMetadata()before resetting resources, addressing the dangling-pointer concern from the previous review.Issues found:
MetadataManager::~MetadataManager()unconditionally callsunregisterMetadata(), meaning anyMetadataManagerinstance — even one that was never registered — will silently clear the registry on destruction, potentially invalidating a live registration.const auto&is consistently bound to thestd::unique_ptr<FileSystem>temporary returned byProvide()in CSV, JSON, and export functions; while technically valid (lifetime extension), the idiomatic pattern isauto fs = vfs->Provide(...).FileSystemRegistry::Provide()infers the protocol only frompaths[0]with no consistency check across remaining paths; mixed-protocol path lists are silently misrouted.#include <glob.h>remains inread_function.hafter direct POSIX glob usage was removed.Confidence Score: 3/5
unregisterMetadata()call inMetadataManager::~MetadataManager()means any second (unregistered) instance being destroyed will silently invalidate the live registry — a real correctness risk in test suites or any code that creates temporaryMetadataManagerobjects. The missing cross-path protocol consistency check inProvide()is a secondary logic concern. These warrant fixes before merging.src/compiler/main/metadata_manager.cpp(unconditional registry clear in destructor) andsrc/utils/file_sys/file_system.cc(no protocol consistency validation across paths).Important Files Changed
FileSystemabstract interface andFileSystemRegistryfactory. Clean design with ashared_mutex-protected factory map; no critical issues.LocalFileSystemandFileSystemRegistry::Provide. Key issue: protocol is inferred only from the first path with no validation across remaining paths, silently misrouting mixed-protocol lists.MetadataRegistry::unregisterMetadata(), meaning any unregisteredMetadataManagerinstance that is destroyed will silently clear the registry entry of a different live instance.MetadataRegistryproviding static global access to the VFS and catalog via a raw pointer.unregisterMetadata()now exists (addressing the previous thread concern), but the destructor-based clearing is unconditional.LocalFileSystemProviderwith VFS-aware dispatch.const auto&binding to temporaryunique_ptrfromProvide()works due to lifetime extension but is stylistically unusual and repeated in bothexecFuncandsniffFunc.const auto&-to-temporary pattern as CSV; otherwise the migration is straightforward and correct.registerFileSystem()toExtensionAPI, enabling third-party extensions to plug in protocol-specific file systems. Clean and straightforward addition.GOptTest::TearDownnow explicitly callsunregisterMetadata()before resetting resources, addressing the dangling-pointer concern raised in the previous thread. Teardown ordering is correct.#include <glob.h>remains even though direct POSIX glob usage has been removed and globbing is now handled bymatch_files_with_pattern.const auto&-to-temporary pattern as the read functions; otherwise functionally correct.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A["MetadataManager\nowns FileSystemRegistry + Catalog"] -->|registerMetadata on construction| B["MetadataRegistry\nstatic raw pointer"] A -->|unregisterMetadata on destruction| B B -->|getVFS| C["FileSystemRegistry\nshared_mutex protected"] B -->|getCatalog| D["GCatalog"] C -->|Provide infers protocol from paths| E["Protocol Selection"] E -->|file protocol| F["LocalFileSystem\nauto-registered default"] E -->|s3 oss http| G["External FileSystem\nregistered via ExtensionAPI"] F -->|glob| H["match_files_with_pattern"] F -->|toArrowFileSystem| I["arrow LocalFileSystem"] G -->|toArrowFileSystem| J["arrow remote FS"] K["CSVReadFunction\nJsonReadFunction\nCSVExportFunction"] -->|getVFS then Provide| C K -->|glob + toArrowFileSystem| L["ArrowReader or ArrowWriter"] M["ExtensionAPI::registerFileSystem"] -->|Register protocol factory| CComments Outside Diff (3)
src/utils/file_sys/file_system.cc, line 4180-4196 (link)Providedetermines the protocol by inspecting only the first element ofschema.paths, then returns a singleFileSysteminstance that is subsequently used toglob()all paths in the schema. If a query provides multiple paths belonging to different protocols (e.g., a mix of localfile://and remotes3://), every path beyond the first will be resolved using the wrong file system — silently returning empty results or throwing.This is also inconsistent with the call sites in
csv_read_function.handjson_read_function.h, where the returnedfsobject is used to glob each path individually:Consider either (a) documenting the assumption that all paths share a single protocol, or (b) calling
Provideonce per path and selecting the appropriate factory for each.src/utils/file_sys/file_system.cc, line 4168-4172 (link)"file", but local paths never containfile://The default factory is registered under the key
"file". InProvide, the fallback case (no://found in the path) assignsprotocol = "file", which correctly resolves. However, if a caller passes a path prefixed withfile://(which is a valid URI scheme for local files),protocolwould be set to"file"via thefind("://")branch — also correct. But a path likefile:///etc/passwdwould setprotocol = "file", strip nothing, and thenLocalFileSystem::globwould receive the full URI string includingfile://, whichmatch_files_with_patternmay not handle correctly.Consider normalising
file://URIs by stripping the scheme before passing the path toLocalFileSystem::glob, or explicitly documenting thatfile://URIs are not supported.include/neug/compiler/planner/gopt_planner.h, line 1277 (link)registerMetadatacalled in constructor but never cleared on destructionGOptPlannercallsMetadataRegistry::registerMetadata(database.get())in its constructor, which stores a raw pointer in a static variable. If a secondGOptPlanneris constructed (e.g., in a server environment that reloads its planner), the old static pointer is silently overwritten. More critically, when the firstGOptPlanneris destroyed, any code that fires during destruction (e.g., from~MetadataManager) that calls back intoMetadataRegistry::getVFS()orMetadataRegistry::getCatalog()will now retrieve the second planner's metadata instead of the first.The
GOptPlannerdestructor (orMetadataManager's destructor) should call anunregisterMetadata()to null out the static pointer when the owning object goes away.Last reviewed commit: "minor fix"