Skip to content

Does MDSWriter release the GIL? #978

@antspy

Description

@antspy

Hi,

I have a lot of small files (100k files, each around 15kb) and I want to save them to a MDS dataset.
I am trying to use multiple writing threads - as recommended in the documentation, I have each worker write a subset of the files in a different subdirectory, and then join the index.json file later on.

The problem is that this is still quite slow (writing at around 300-500 kb / second), and having multiple writers doesn't seem to help speeding this up too much. Does MDSWriter release the GIL when writing to remote files (an S3 bucket)? Or should I use different processes in this case?

Are there any tricks I can use to improve / debug the writing speed? Even the single threaded performance seems suboptimal. Does it make more sense to point the MDS Writer to a local folder instead, and upload the entire folder when done manually?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions