Skip to content

Conversation

Copy link

Copilot AI commented Nov 19, 2025

Custom Extractor Support Implementation ✅

This PR successfully implements support for custom extractors as requested in the issue. The solution allows library users to extend RecursiveExtractor with support for additional archive types (like .MSI and .MSP packages).

✅ All Tasks Complete:

  • Create ICustomAsyncExtractor interface extending AsyncExtractorInterface with a bool CanExtract(Stream) method
  • Add custom extractors support to Extractor class using constructor-based dependency injection
  • Use ICollection<ICustomAsyncExtractor> (implemented as HashSet) for internal storage
  • Add new constructor Extractor(IEnumerable<ICustomAsyncExtractor>) for DI support
  • Modify extraction logic to check custom extractors when file type is UNKNOWN
  • Add comprehensive tests for custom extractor functionality (10 tests, all passing)
  • Add documentation to README with examples
  • Run CodeQL security checks (no vulnerabilities found)
  • Verify implementation with standalone test program
  • Make CustomExtractors property internal (per code review feedback)
  • Rename interface to follow .NET naming conventions (ICustomAsyncExtractor)

Implementation Summary:

1. New Interface: ICustomAsyncExtractor

  • Extends AsyncExtractorInterface
  • Adds bool CanExtract(Stream) method for file type detection based on binary signatures
  • Similar to how MiniMagic works for built-in types
  • Follows .NET naming conventions for interfaces

2. Extractor API Enhancements:

  • CustomExtractors property (ICollection, internal) to store custom extractors
  • New constructor Extractor(IEnumerable<ICustomAsyncExtractor>) for dependency injection support
  • HashSet implementation for efficient lookup and deduplication
  • Null-safe handling of custom extractors in constructor
  • FindMatchingCustomExtractor() - Internal helper to find matching extractors

3. Extraction Logic Updates:

  • Modified both synchronous Extract and asynchronous ExtractAsync methods
  • When file type is UNKNOWN or no built-in extractor exists:
    1. Checks registered custom extractors using their CanExtract method
    2. Uses the first matching custom extractor to extract the file
    3. Falls back to returning the raw file if no custom extractor matches
  • Follows existing logging and error handling conventions

4. Comprehensive Testing:

  • Created CustomExtractorTests.cs with 10 tests covering:
    • Constructor-based registration with single and multiple extractors
    • Null handling in constructor
    • Sync and async extraction using custom extractors
    • Fallback behavior when no custom extractor matches
    • Interaction with built-in extractors (built-in takes precedence)
  • All 10 tests pass
  • Tests can access internal members via InternalsVisibleTo attribute

5. Documentation:

  • Added detailed section to README.md with example code
  • Shows how to implement ICustomAsyncExtractor
  • Demonstrates constructor-based registration pattern for DI
  • Explains key concepts and best practices

6. Security & Quality:

  • CodeQL security analysis: ✅ No vulnerabilities found
  • Follows repository coding conventions
  • Proper error handling with try-catch in custom extractor checks
  • Stream position preservation in CanExtract implementations

7. Verification:

  • Created and ran standalone test program demonstrating end-to-end functionality
  • Custom extractor successfully recognizes and extracts custom file formats
  • Integration with existing extraction pipeline works correctly

Changes from Code Review:

  • Changed CustomExtractors property from public to internal per @gfs feedback
  • Added InternalsVisibleTo attribute to allow test assembly to access internal members
  • Renamed interface from CustomAsyncExtractorInterface to ICustomAsyncExtractor to follow .NET naming conventions
  • Changed CustomExtractors type from HashSet<ICustomAsyncExtractor> to ICollection<ICustomAsyncExtractor> per @gfs feedback for better abstraction
  • Replaced Add/Remove/Clear methods with constructor-based injection per @gfs and @danfiedler-msft feedback to support modern DI patterns

Pre-existing Issues:

  • 74 test failures existed before this PR (unrelated to this change)
  • These remain unchanged and are not caused by this implementation

The implementation is complete, tested, secure, and ready for use!

Original prompt

This section details on the original issue you should resolve

<issue_title>Support custom extractors</issue_title>
<issue_description>The current interface doesn't allow registering custom extractors, which limits the package's functionality.
For example, I need to unpack .MSI and .MSP packages.

The hacky approach with UNKNOWN isn't suitable, as it's explicitly filtered.

I'd like to either replace the enum with a string representation or add an alternative exctractor resolution mechanism.</issue_description>

<agent_instructions>Please implement the proposed solution that includes extending the AsyncExtractorInterface with a new bool CanExtract method modeled after the behavior of MiniMagic. The Extractor class should then have a set of Custom Extractors that can be customized by a library user to add additional CustomAsyncExtractorInterface implementors. If a filetype isn't detected as being one of the existing supported types, and comes out the minimagic checks as unknown, we would then check the list of custom injected implementations CanExtract method to see if any of them support it. When implementing follow the normal coding and logging conventions as seen in the respository and be sure ot add some tests for this new functionality.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@gfs @LunarWhisper Thanks for the suggestion. I appreciate the use case for wanting to programatically extend the library. I wonder though, this seems like it gets a little sticky if you were to implement multiple new extractor types - you can't set multiple extractors for the same type and you wouldn't be able to route them either, and would still require some custom logic to allow for checking the unknown type rather than the extant logic that treats those as literal files.

The current type of archive detection logic operates on the bytes of the archive using the minimagic utility, rather than file extension. My thoughts on how this might work best, with supporting arbitrary number of extensions, is to add a new property to Extractor.cs for an arbitrary number of custom extractors. For these custom extractors, maybe there could be a CustomIAsyncExtractor interface that also implements a CanExtract method that checks for binary signatures - like mini magic does - and then the generic extractor can when encountering something that isn't default supported 1. check if there are any custom extractors defined 2. if there are iterate over them executing the CanExtract on the Stream 3. when one of them indicates it can extract, use that one.

Thoughts on something like that? would that serve your use case?</comment_new>


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits November 19, 2025 22:43
Co-authored-by: gfs <98900+gfs@users.noreply.github.com>
Co-authored-by: gfs <98900+gfs@users.noreply.github.com>
Copilot AI changed the title [WIP] Add support for custom extractors in AsyncExtractorInterface Add custom extractor interface for user-defined archive formats Nov 19, 2025
Copilot AI requested a review from gfs November 19, 2025 22:51
Copilot finished work on behalf of gfs November 19, 2025 22:51
@gfs
Copy link
Contributor

gfs commented Nov 19, 2025

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@gfs gfs marked this pull request as ready for review November 20, 2025 16:49
@gfs
Copy link
Contributor

gfs commented Nov 20, 2025

@copilot Change the CustomExtractors property on the Extractor class to be internal instead of public.

Co-authored-by: gfs <98900+gfs@users.noreply.github.com>
Copy link
Author

Copilot AI commented Nov 20, 2025

@copilot Change the CustomExtractors property on the Extractor class to be internal instead of public.

Changed CustomExtractors property to internal and added InternalsVisibleTo attribute for test access in commit be00882.

Copilot finished work on behalf of gfs November 20, 2025 17:01
@gfs
Copy link
Contributor

gfs commented Nov 20, 2025

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

scovetta
scovetta previously approved these changes Nov 21, 2025
Copy link
Contributor

@danfiedler-msft danfiedler-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

There are a few choices in the PR that don't match up with modern .NET conventions but are fine as is.

Please feel free to ignore the nits as you like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know the bot is following the existing naming pattern, but the interface should be named ICustomAsyncExtractor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. @copilot please rename the newly added interface.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed interface to ICustomAsyncExtractor following .NET naming conventions in commit 387d579.

/// Collection of custom extractors that can handle file types not natively supported.
/// These are checked when a file type is detected as UNKNOWN.
/// </summary>
internal HashSet<CustomAsyncExtractorInterface> CustomExtractors { get; } = new HashSet<CustomAsyncExtractorInterface>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the choice of HashSet is odd here. I'd expect this to be IEnumerable<ICustomAsyncExtractor>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought HashSet was fine as an implementation level detail but did intentionally change this field to internal so as not to expose that. I think you're right though that there's no real reason to define that on the type side. I think I would argue that ICollection is better than IEnumerable, however, as conceptually this, I think, should always be a consistent set at runtime and not reenumerated on each touch of the set. I could probably be convinced otherwise if there's something I've not thought of. @danfiedler-msft

@copilot Please update this field definition type to be ICollection instead of HashSet. The acutal implementation could remain hashset.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed CustomExtractors type to ICollection<ICustomAsyncExtractor> (implemented as HashSet) in commit 199d46a.

/// </summary>
/// <param name="customExtractor">The custom extractor implementation to add.</param>
/// <returns>True if the extractor was added, false if it was already present.</returns>
public bool AddCustomExtractor(CustomAsyncExtractorInterface customExtractor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The Add/Remove paradigm seems odd to me. My preference would be to add a constructor that takes a IEnumerable<ICustomAsyncExtractor> parameter. This would support use of DI with Extractor as well.

Is there ever a case where an application would want to remove a custom extractor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a convincingly good reason youd want to remove extractors, as you could just instantiate a new Extractor object. I think the add/remove paradigm kind of fits with the previous setting mechanism but realistically the constructor mechanism fits all the same use cases.

@copilot lets remove the Add/Remove mechanisms and instead have a new constructor for Extractor that takes IEnumerable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed Add/Remove/Clear methods and added constructor Extractor(IEnumerable<ICustomAsyncExtractor>) for DI support in commit 199d46a.

…e review

Co-authored-by: gfs <98900+gfs@users.noreply.github.com>
Copilot AI dismissed stale reviews from danfiedler-msft and scovetta via 387d579 November 21, 2025 19:06
Copilot finished work on behalf of gfs November 21, 2025 19:08
…ction

Co-authored-by: gfs <98900+gfs@users.noreply.github.com>
Copilot finished work on behalf of gfs November 21, 2025 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support custom extractors

4 participants