Skip to content

feat: Provide unified VFS for different protocol file systems#77

Open
shirly121 wants to merge 4 commits intoalibaba:mainfrom
shirly121:add_vfs
Open

feat: Provide unified VFS for different protocol file systems#77
shirly121 wants to merge 4 commits intoalibaba:mainfrom
shirly121:add_vfs

Conversation

@shirly121
Copy link
Collaborator

@shirly121 shirly121 commented Mar 18, 2026

What do these changes do?

Related issue number

Fixes #55

Greptile Summary

This PR replaces the old Kùzu-derived VirtualFileSystem / LocalFileSystem / CompressedFileSystem stack (under neug::common) with a new, protocol-oriented FileSystemRegistry (under neug::fsys), and introduces a MetadataRegistry singleton to give any component access to the active VFS and catalog without threading a ClientContext pointer everywhere. The change also removes the EXPORT DATABASE / IMPORT DATABASE binder paths, the GVfsHolder gopt component, and a number of WAL/storage headers that were no longer in use.

Key changes:

  • New include/neug/utils/file_sys/file_system.h / src/utils/file_sys/file_system.cc: a clean factory-registry (FileSystemRegistry) with a shared_mutex-protected map of protocol → FileSystemFactory lambdas; a default LocalFileSystem factory is registered for the "file" protocol.
  • New MetadataRegistry static class providing getVFS() and getCatalog() accessors, replacing the now-removed GVfsHolder and GCatalogHolder gopt components.
  • All read/write functions (CSV, JSON, export) migrated from hard-coded LocalFileSystemProvider to MetadataRegistry::getVFS()->Provide(schema).
  • ExtensionAPI::registerFileSystem() added so external extensions can plug in protocol-specific file systems (e.g. S3, OSS).
  • GOptTest::TearDown now calls MetadataRegistry::unregisterMetadata() before resetting resources, addressing the dangling-pointer concern from the previous review.

Issues found:

  • MetadataManager::~MetadataManager() unconditionally calls unregisterMetadata(), meaning any MetadataManager instance — even one that was never registered — will silently clear the registry on destruction, potentially invalidating a live registration.
  • const auto& is consistently bound to the std::unique_ptr<FileSystem> temporary returned by Provide() in CSV, JSON, and export functions; while technically valid (lifetime extension), the idiomatic pattern is auto fs = vfs->Provide(...).
  • FileSystemRegistry::Provide() infers the protocol only from paths[0] with no consistency check across remaining paths; mixed-protocol path lists are silently misrouted.
  • Stale #include <glob.h> remains in read_function.h after direct POSIX glob usage was removed.

Confidence Score: 3/5

  • Functionally sound for the single-MetadataManager production path, but the unconditional registry clear in the destructor is a latent correctness issue that can manifest in tests or edge cases.
  • The core VFS refactor is clean and the happy-path logic is correct. However, the unconditional unregisterMetadata() call in MetadataManager::~MetadataManager() means any second (unregistered) instance being destroyed will silently invalidate the live registry — a real correctness risk in test suites or any code that creates temporary MetadataManager objects. The missing cross-path protocol consistency check in Provide() is a secondary logic concern. These warrant fixes before merging.
  • src/compiler/main/metadata_manager.cpp (unconditional registry clear in destructor) and src/utils/file_sys/file_system.cc (no protocol consistency validation across paths).

Important Files Changed

Filename Overview
include/neug/utils/file_sys/file_system.h New unified VFS header introducing FileSystem abstract interface and FileSystemRegistry factory. Clean design with a shared_mutex-protected factory map; no critical issues.
src/utils/file_sys/file_system.cc Implements LocalFileSystem and FileSystemRegistry::Provide. Key issue: protocol is inferred only from the first path with no validation across remaining paths, silently misrouting mixed-protocol lists.
src/compiler/main/metadata_manager.cpp Destructor unconditionally calls MetadataRegistry::unregisterMetadata(), meaning any unregistered MetadataManager instance that is destroyed will silently clear the registry entry of a different live instance.
src/compiler/main/metadata_registry.cpp New MetadataRegistry providing static global access to the VFS and catalog via a raw pointer. unregisterMetadata() now exists (addressing the previous thread concern), but the destructor-based clearing is unconditional.
include/neug/compiler/function/csv_read_function.h Replaces hardcoded LocalFileSystemProvider with VFS-aware dispatch. const auto& binding to temporary unique_ptr from Provide() works due to lifetime extension but is stylistically unusual and repeated in both execFunc and sniffFunc.
extension/json/include/json_read_function.h JSON read/sniff functions now use the unified VFS. Same const auto&-to-temporary pattern as CSV; otherwise the migration is straightforward and correct.
include/neug/compiler/extension/extension_api.h Adds registerFileSystem() to ExtensionAPI, enabling third-party extensions to plug in protocol-specific file systems. Clean and straightforward addition.
tests/compiler/gopt_test.h GOptTest::TearDown now explicitly calls unregisterMetadata() before resetting resources, addressing the dangling-pointer concern raised in the previous thread. Teardown ordering is correct.
include/neug/compiler/function/read_function.h Simplified after VFS refactor; stale #include <glob.h> remains even though direct POSIX glob usage has been removed and globbing is now handled by match_files_with_pattern.
src/compiler/function/csv_export_function.cpp Export path now uses the unified VFS for protocol dispatch. Same const auto&-to-temporary pattern as the read functions; otherwise functionally correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["MetadataManager\nowns FileSystemRegistry + Catalog"] -->|registerMetadata on construction| B["MetadataRegistry\nstatic raw pointer"]
    A -->|unregisterMetadata on destruction| B

    B -->|getVFS| C["FileSystemRegistry\nshared_mutex protected"]
    B -->|getCatalog| D["GCatalog"]

    C -->|Provide infers protocol from paths| E["Protocol Selection"]
    E -->|file protocol| F["LocalFileSystem\nauto-registered default"]
    E -->|s3 oss http| G["External FileSystem\nregistered via ExtensionAPI"]

    F -->|glob| H["match_files_with_pattern"]
    F -->|toArrowFileSystem| I["arrow LocalFileSystem"]
    G -->|toArrowFileSystem| J["arrow remote FS"]

    K["CSVReadFunction\nJsonReadFunction\nCSVExportFunction"] -->|getVFS then Provide| C
    K -->|glob + toArrowFileSystem| L["ArrowReader or ArrowWriter"]

    M["ExtensionAPI::registerFileSystem"] -->|Register protocol factory| C
Loading

Comments Outside Diff (3)

  1. src/utils/file_sys/file_system.cc, line 4180-4196 (link)

    P1 Protocol inferred from first path only; used for all paths

    Provide determines the protocol by inspecting only the first element of schema.paths, then returns a single FileSystem instance that is subsequently used to glob() all paths in the schema. If a query provides multiple paths belonging to different protocols (e.g., a mix of local file:// and remote s3://), every path beyond the first will be resolved using the wrong file system — silently returning empty results or throwing.

    This is also inconsistent with the call sites in csv_read_function.h and json_read_function.h, where the returned fs object is used to glob each path individually:

    for (const auto& path : state->schema.file.paths) {
      const auto& resolved = fs->glob(path);   // same fs for ALL paths
      ...
    }

    Consider either (a) documenting the assumption that all paths share a single protocol, or (b) calling Provide once per path and selecting the appropriate factory for each.

  2. src/utils/file_sys/file_system.cc, line 4168-4172 (link)

    P2 Default "file" factory registered with protocol string "file", but local paths never contain file://

    The default factory is registered under the key "file". In Provide, the fallback case (no :// found in the path) assigns protocol = "file", which correctly resolves. However, if a caller passes a path prefixed with file:// (which is a valid URI scheme for local files), protocol would be set to "file" via the find("://") branch — also correct. But a path like file:///etc/passwd would set protocol = "file", strip nothing, and then LocalFileSystem::glob would receive the full URI string including file://, which match_files_with_pattern may not handle correctly.

    Consider normalising file:// URIs by stripping the scheme before passing the path to LocalFileSystem::glob, or explicitly documenting that file:// URIs are not supported.

  3. include/neug/compiler/planner/gopt_planner.h, line 1277 (link)

    P1 registerMetadata called in constructor but never cleared on destruction

    GOptPlanner calls MetadataRegistry::registerMetadata(database.get()) in its constructor, which stores a raw pointer in a static variable. If a second GOptPlanner is constructed (e.g., in a server environment that reloads its planner), the old static pointer is silently overwritten. More critically, when the first GOptPlanner is destroyed, any code that fires during destruction (e.g., from ~MetadataManager) that calls back into MetadataRegistry::getVFS() or MetadataRegistry::getCatalog() will now retrieve the second planner's metadata instead of the first.

    The GOptPlanner destructor (or MetadataManager's destructor) should call an unregisterMetadata() to null out the static pointer when the owning object goes away.

Last reviewed commit: "minor fix"

Greptile also left 3 inline comments on this PR.

Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
@shirly121 shirly121 requested a review from BingqingLyu March 19, 2026 04:09
shirly121 and others added 2 commits March 19, 2026 13:59
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Committed-by: Xiaoli Zhou from Dev container
@shirly121
Copy link
Collaborator Author

@greptile

}

MetadataManager::~MetadataManager() = default;
MetadataManager::~MetadataManager() { MetadataRegistry::unregisterMetadata(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Destructor unconditionally clears registry regardless of which instance is registered

MetadataManager::~MetadataManager() always calls MetadataRegistry::unregisterMetadata(), which sets the static metadataManager pointer to nullptr. This means if ANY MetadataManager instance is destroyed — even one that was never registered — it silently invalidates the current registry entry.

Consider a scenario where a second (unregistered) MetadataManager instance is created and then destroyed (e.g., a local/temporary instance in a unit test or helper function). Its destructor would clear the registry, leaving the primary, still-live MetadataManager unreachable via MetadataRegistry::getVFS() or MetadataRegistry::getCatalog(), causing subsequent calls to throw.

A safer pattern guards on identity before clearing:

MetadataManager::~MetadataManager() {
    // Only unregister if *this* instance is the currently registered one
    if (MetadataRegistry::metadataManager == this) {
        MetadataRegistry::unregisterMetadata();
    }
}

This requires either making metadataManager accessible (e.g., via a friend declaration already in place) or adding a guarded unregisterMetadata(MetadataManager*) overload.

Comment on lines +119 to +121
const auto& vfs = neug::main::MetadataRegistry::getVFS();
const auto& fs = vfs->Provide(state->schema.file);
auto resolvedPaths = std::vector<std::string>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 const auto& bound to a temporary unique_ptr is misleading about ownership

vfs->Provide(...) returns a std::unique_ptr<FileSystem> by value. Binding a const auto& to this prvalue does extend the temporary's lifetime to the reference's scope (C++ [class.temporary]), but the pattern is fragile and unusual: it prevents moving the pointer, and readers may not immediately recognize that the object is owned by the reference here rather than held elsewhere.

The same pattern appears in:

  • include/neug/compiler/function/csv_read_function.h lines 119–120 (both execFunc and sniffFunc)
  • extension/json/include/json_read_function.h lines 47–48 and 79–80
  • src/compiler/function/csv_export_function.cpp lines 63–64

Prefer the idiomatic owning declaration:

Suggested change
const auto& vfs = neug::main::MetadataRegistry::getVFS();
const auto& fs = vfs->Provide(state->schema.file);
auto resolvedPaths = std::vector<std::string>();
const auto& vfs = neug::main::MetadataRegistry::getVFS();
auto fs = vfs->Provide(state->schema.file);

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +70 to +84
std::string protocol = schema.protocol;
if (protocol.empty()) {
const auto& paths = schema.paths;
if (paths.empty()) {
THROW_INVALID_ARGUMENT_EXCEPTION("No file paths provided");
}
// we assume all paths share the same protocol
const auto& path = paths[0];
auto pos = path.find("://");
if (pos != std::string::npos) {
protocol = path.substr(0, pos);
} else {
protocol = "file";
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Mixed-protocol path lists are silently misrouted

When schema.protocol is empty, Provide() infers the protocol from only paths[0] and uses that single protocol to dispatch all paths. The comment acknowledges this assumption (// we assume all paths share the same protocol), but there is no validation that the remaining paths actually honour it.

If a caller passes paths = ["s3://bucket/a.csv", "file:///local/b.csv"], the s3 factory is selected and applied to both paths. The local path would be handed to the S3 file system, likely causing a confusing runtime error far from the source of the mistake.

Consider validating protocol consistency across all paths, or at least document this as a precondition that callers are responsible for enforcing:

// After determining `protocol` from paths[0]:
for (size_t i = 1; i < paths.size(); ++i) {
    auto pos = paths[i].find("://");
    std::string p = (pos != std::string::npos) ? paths[i].substr(0, pos) : "file";
    if (p != protocol) {
        THROW_INVALID_ARGUMENT_EXCEPTION(
            "All paths must use the same protocol; got '" + protocol +
            "' and '" + p + "'");
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add VirtualFileSystemManager for dynamic file system registration (local, http, s3, oss)

1 participant