Hi,
I have a lot of small files (100k files, each around 15kb) and I want to save them to a MDS dataset.
I am trying to use multiple writing threads - as recommended in the documentation, I have each worker write a subset of the files in a different subdirectory, and then join the index.json file later on.
The problem is that this is still quite slow (writing at around 300-500 kb / second), and having multiple writers doesn't seem to help speeding this up too much. Does MDSWriter release the GIL when writing to remote files (an S3 bucket)? Or should I use different processes in this case?
Are there any tricks I can use to improve / debug the writing speed? Even the single threaded performance seems suboptimal. Does it make more sense to point the MDS Writer to a local folder instead, and upload the entire folder when done manually?