Skip to content

Conversation

@masyukun
Copy link

Missing feature -- figured I'd add it to the existing tool rather than writing yet another one-off.

@masyukun masyukun requested a review from a team as a code owner December 19, 2025 20:23
@masyukun masyukun requested review from FGasper and removed request for a team December 19, 2025 20:23
@FGasper
Copy link
Collaborator

FGasper commented Jan 6, 2026

@masyukun Thank you for your submission.

What benefit does this confer versus running multiple import processes in parallel, e.g.:

ls *.json | parallel --jobs 4 'mongoimport --db myDB --collection myColl --file {}'

Copy link
Collaborator

@FGasper FGasper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See earlier question.

Thank you!

@masyukun
Copy link
Author

@masyukun Thank you for your submission.

What benefit does this confer versus running multiple import processes in parallel, e.g.:

ls *.json | parallel --jobs 4 'mongoimport --db myDB --collection myColl --file {}'

Hi FGasper -- it's the same set of benefits you get between doing 1000 individual inserts vs a single insertMany/bulkWrite:

  • a single connection / session to the database instead of a volatile number of them
  • orders of magnitude faster transfer of many documents (minutes vs hours)

@FGasper
Copy link
Collaborator

FGasper commented Jan 12, 2026

@masyukun Do you have some reproducible benchmarks to demonstrate the performance benefit of your branch over parallel restorations?

@masyukun
Copy link
Author

masyukun commented Jan 12, 2026

@FGasper Sure -- Unless you have a preferred test harness, I have a folder full of files and can run the two commands on a timer, then submit screenshots.

@FGasper
Copy link
Collaborator

FGasper commented Jan 12, 2026

@masyukun What we’d like to see is reproducible results comparing parallel invocation versus your submission.

@masyukun
Copy link
Author

masyukun commented Jan 12, 2026

Data folder

  • Kind: 18,247 documents
  • Size: 131,917,828 bytes (168.3 MB on disk)
  • ZIP too large to attach (59.8MB exceeds 25MB limit), but available on request

Method 1: parallel using *NIX pipes

Command

Original command: ls *.json | parallel --jobs 4 'mongoimport --db myDB --collection myColl --file {}'

Executed command: time find /Users/matthewroyal/Documents/GitHub/health-inspect/inspection_reports -name '*.json' | parallel --jobs 4 './mongoimport --uri="mongodb+srv://***:***@fooddata.8huxa.mongodb.net/" --db myDB --collection myCollParallel --file {}'

  • Error: zsh: argument list too long: ls
    • My folder has only 18,247 JSON files -- many real-world environments have millions to billions of files.
    • Since this command has evidently not been run in zsh in a folder with many JSON files, I substitute the ls command with the more scalable find command: find /Users/matthewroyal/Documents/GitHub/health-inspect/inspection_reports -name '*.json'
  • Prefixed command with time to track execution time with CLI-based stopwatch
  • Substituted included mongotools version in root project folder vs standard command in PATH
  • Customized db and collection location, anonymized URI string

Prerequisites

  • Installed parallel on Mac using brew install parallel

Output

  • Ingestion time was 57:16.55 -- nearly an hour (67.5x longer than --dir option)
  • See attached output log "myCollParallel-output.log"

Evaluation

  • mongoimport run through the bash command line with the parallel is impractical for real-world workloads
  • Many customers restrict installation of 3rd party tools, and may not have access to the package parallel
  • 25 Transient connection Errors occurred while uploading the folder of JSON files
    • This method masks the filename being attempted
    • Handling or summarizing errors is difficult and labor intensive without reporting of which files failed

Method 2: --dir flag

Command

time ./bin/mongoimport --dir=/Users/matthewroyal/Documents/GitHub/health-inspect/inspection_reports --uri="mongodb+srv://***:***@fooddata.8huxa.mongodb.net/" --db=myDB --collection=myCollDirflag

Prerequisites

  • Clone PR, build, and run command from root checkout directory

Output

  • Ingestion time was 50.912 -- nearly a minute (98.52% faster than baseline mongoimport run in parallel with 3rd party library)
  • See attached output log "myCollDirflag-output.log"

Evaluation

  • Execution time was dramatically improved over even parallel runs of single file insertions by using insertMany/bulkWrite construct.
    • This is a common use case encountered in real-world systems, as evidenced by our own forums, with a current workaround of abandoning mongoimport and writing a custom one-off data loader instead, increasing friction and maintenance for DevOps users.
  • Failures noted during ingestion -- none

Performance metrics comparison

Screenshot 2026-01-12 at 3 28 31 PM

myCollDirflag-output.log

myCollParallel-output.log

@masyukun masyukun requested a review from FGasper January 12, 2026 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants