Skip to content

Verb Merge

ligos edited this page Aug 3, 2015 · 16 revisions

Verb - Merge

Merges one or more text files into a single file. Lines are sorted and duplicates are removed.

This action is designed to be as fast as possible, see performance below for more information.

Algorithm

The basic algorithm for sorting is a merge sort by an initial chunk or shard, which is buffered to disk. Buffering to disk assumes more data than available memory, so this won't be as fast as a direct sort for small files (but, small files are sorted quickly anyway).

There are three phases:

Phase 1 - Sharding / Splitting

All input files are read (in parallel if multiple or large files) and each line is written to a file based on the first byte of the line. That is, the input lines are split up into files based on how the lines start.

Any files which are larger than --max-sort-size are split again on the 2nd byte. And so on until all files are less than --max-sort-size.

Phase 2 - Sorting and De-Duplication

Sorting is the major computational part of this process. Each sharded file is read in sorted order and the lines themselves are sorted. Files are sorted in parallel based on the number of workers, then duplicates are removed.

Phase 3 - Write

As the final sorted results are written to the output file in sorted order.

Memory Usage

Sorting, by nature, will use all the memory you can throw at it. Sorting gigabytes of files will use gigabytes of memory. The more memory you have, the better.

If you are curious how much memory your merge will need, add the --debug option and a memory usage estimate will be shown. And, you can use task manager, Process Explorer, VMMap or PerfView to see the gruesome details.

The out of the box defaults should be OK if you have 2GB of available physical RAM (a 4GB machine should be fine) and up to 4 CPU cores.

The two options which most influence memory usage are --max-sort-size and --sort-workers.

The --max-sort-size buffer size is allocated 4 times over when sorting for various purposes. Reducing (or increasing) it has a substantial impact on memory usage. The default is 64MB which provides good performance without eating too much memory. If you have lots of RAM, consider increasing this. If you are running short of physical RAM, consider reducing it.

Every sort worker thread must allocate every buffer. So memory usage is multiplied by the number of workers. If you're running short of memory, reduce your workers.

.NET is a garbage collected language, at it, by nature, will allocate more memory than it needs as well as not releasing it until well after its done with it. This will show up as continuous memory growth during the sorting phase (often several gigabytes) followed by a sudden cliff as the garbage collector reclaims lots of memory. The garbage collection process happens automatically, when the system thinks it needs to (and the system is actually quite smart). You can force garbage collection after each split or sort using the --aggressive-memory-collection option. This should keep peak memory usage down, but it does not reduce overall memory requirements and can actually lengthen the sorting process (because the process must stop while garbage collation happens).

Performance

MassiveSort is designed to be fast. How fast?

22GB of files can be merged and sorted in 5.5 minutes on high end hardware! Or half an hour on a more modest machine. More benchmarks are below.

Procexp graph - annotated small.png

Yes, that really is 26GB of allocated memory and 2G bytes of IO (probably mostly from disk cache)! You can't see from the graph, but its a 16 core machine as well.

MassiveSort runs in parallel on as many CPUs as you have. Almost all operations are run in parallel, and a high CPU usage across many cores are observed (see the graph above).

How To Make It Go Faster

In terms of performance, the most important thing is IO speed. Having your input and output locations on separate disks will improve things. Using a temporary location on an SSD will improve things (in fact, running everything off an SSD works wonders). Using disk arrays over single disks helps too.

More CPUs do not help unless you can keep them fed with data from your disks. Seriously, adding more and faster disks (or better yet, SSDs) is the best way to improve performance.

If you have enough RAM, increasing the --max-sort-size will improve performance as it means less splitting needs to be done. You can also tweak the various file buffers, although they yield much smaller gains.

Benchmarks

For serious performance testing, a set of word lists totalling about 22GB is used (the large merge test set). Benchmarks for the large merge test set:

  • Azure D14 VM (Xeon E5-2660 2.2GHz, 2 sockets, 16 cores, 112GB RAM, all working on local SSD, --max-sort-size 256MB)

    • Total run time: 5.40 min
    • Split time: 170.47 sec
    • Sort time: 153.06 sec
  • Home Server (Phenom II X4 925 2.80GHz, 4 cores, 12GB RAM, reading from USB3 disk, temp and target on 4 disk SATA mirrored Storage Spaces array, --max-sort-size 128MB)

    • Total run time: 29.51 min
    • Split time: 21.21 min
    • Sort time: 8.21 min
  • Acer Aspire 4820TG laptop (Core i5 M460 2.53GHz, 2 cores, 6GB RAM, reading and writing to USB2 external disk, temp files on local SSD, --max-sort-size 64MB)

    • Total run time: 43.66 min
    • Split time: 26.06 min
    • Sort time: 17.64 min

Full console logs and stats are available for each benchmark in source code.

Example Usage

1. Basic Usage

Sort a single file.

MassiveSort merge -i c:\data\file.txt -o c:\data\file.sorted.txt

2. Merge Files

Sort and merge several files. You simply keep adding files you'd like to sort to the -i option.

MassiveSort merge -i c:\data\file.txt c:\data\another_file.txt -o c:\data\two_files.sorted.txt

3. Merge Whole Folders

This recursively loads all files from a folder.

MassiveSort merge -i c:\data -o c:\data\all_files.sorted.txt

4. Use A Temp Folder

By default, the %TEMP% environment variable is used to store temporary data when sorting. You may choose to use a different folder if you have another physical disk available (for performance reasons) or simply because %TEMP% does not have enough space available.

MassiveSort merge -i c:\data -o c:\data\all_files.sorted.txt -t x:\mergeTemp

5. Increase Sort Memory

Up to 64MB of files are sorted in one go, by default. Increasing the amount of data sorted in one go will improve performance if you have enough physical RAM available. Note that this option refers to the maximum size of files sorted in one chunk; about 4x this amount of RAM is actually required for the sort process.

You can use shortcuts for the size such as KB, MB or GB. Although the allowed range is 1MB - 1GB.

MassiveSort merge -i c:\data -o c:\data\all_files.sorted.txt --max-sort-size 128MB

6. Decrease Sort Memory

Up to 64MB of files are sorted in one go, by default. If you are running short of physical RAM, you can decrease the amount of data sorted in one go, at the cost of performance. Note that this option refers to the maximum size of files sorted in one chunk; about 4x this amount of RAM is actually required for the sort process.

You can use shortcuts for the size such as KB, MB or GB. Although the allowed range is 1MB - 1GB.

MassiveSort merge -i c:\data -o c:\data\all_files.sorted.txt --max-sort-size 32MB

7. Set Workers During Split Phase

By default, there will be the same number of workers as your have physical CPUs (hyper-threaded / virtual CPUs are not counted). Split performance is dominated by IO, so if you have slow disks you may reduce this. Alternately, if your CPU is maxed out during the split phase (and you have crazy fast SSDs), you may wish to increase this.

MassiveSort merge -i c:\data -o c:\data\all_files.sorted.txt --split-workers 2

8. Set Workers During Sort Phase

By default, there will be the same number of workers as your have physical CPUs (hyper-threaded / virtual CPUs are not counted). The main reason to reduce the number of sort workers is to reduce physical memory use. You may get some benefit increasing this to use hyper-threaded / virtual CPUs, but it is unlikely to be substantial.

MassiveSort merge -i c:\data -o c:\data\all_files.sorted.txt --sort-workers 2

9. List Duplicates

A list of duplicate liens will be created in a file ending in .duplicates, along side your final output file. c:\data\all_files.sorted.txt.duplicates in this example. Note that duplicates are only listed once (that is, the duplicates file is itself de-duplicated).

MassiveSort merge -i c:\data -o c:\data\all_files.sorted.txt --save-duplicates

10. Convert to $HEX[]

Lines with bytes outside the usual ascii printable range (less than 0x20 and greater then 0x7e) can be converted to a hex format, in line with what many other utilities can process. This requires re-writing some lines, so incurs a small performance hit. Lines with only printable bytes are unchanged.

Note that MassiveSort itself does not depend on any special characters other than carriage return and line feed (0x0a and 0x0d). It does not attempt to interpret lines as any particular string or encoding, simply deals with them as sequences of bytes.

Waffle's original request in hashcat was used as the basis of the implementation: https://hashcat.net/trac/ticket/148

MassiveSort merge -i c:\data -o c:\data\all_files.sorted.txt --convert-to-dollar-hex

Command Line Reference

All command line options are listed below.

Required Arguments

  • -i --input - One or more files or folders to sort
  • -o --output - A file to write the output to

General Options

  • -t --temp-folder - Folder to use for writing temporary files
    • Default: %TEMP%\MassiveSort\<PID>
    • Putting the temp files on a different disk to your final output can significantly improve IO performance. SSDs are fantastic.
  • --leave-duplicates - Leave duplicates in the output file
    • Default: remove duplicates
  • --save-duplicates - Save duplicates to a separate .duplicates file.
    • Default: do not save duplicates
    • This has a small performance impact
  • --convert-to-dollar-hex - Converts non-ascii bytes to $HEX[] format
    • Default: make no changes to non-ascii bytes
    • This has a small performance impact
  • --save-stats - Saves more detailed stats to .stats file.
  • --whitespace - Changes made to whitespace when processing. WARNING: this can make destructive changes to your inputs
    • NoChange: no changes to whitespace (default)
    • Trim: removes leading and trailing whitespace
    • Strip: removes all whitespace
  • --whitespace-chars - Byte(s) considered whitespace
    • Default: 0x09, 0x0b, 0x20

Sorting Options

  • -s --sort-by - Order to sort lines in
    • Dictionary: natural dictionary order (default) - eg: a, aa, ab, abc, b, bb
    • Length: length first, then dictionary order - eg: a, b, aa, ab, bb, abc
  • --max-sort-size - Largest chunk of files to sort as a group
    • Default: 64MB
    • This is the major contributor to memory usage, decrease this if you are running short of physical ram
  • --sort-algorithm - Sort agorithm to use, options:
    • DefaultArray: Array.Sort()
    • LinqOrderBy: Enumerable.OrderBy()
    • TimSort: Timsort algorithm (default)

Worker / Thread Options

  • --split-workers - Number of worker threads used when splitting files
    • Default: number of physical cores in your PC
  • --sort-workers - Number of worker threads used when sorting files
    • Major contributor to memory usage, decrease this is you are running short of physical ram
    • Default: number of physical cores in your PC
  • --io-workers - Number of worker threads for IO intensive operations (eg: creating / flushing files)
    • Default: 8 workers
    • This has minimal memory impact, SSD users may wish to increase this to 32

Large File Options

  • --large-threshold - Files greater than this are considered large.
    • One or two very large files can cause a bottle-neck in the splitting phase. Any file larger than this value will be processed in parts in parallel.
    • Default: 1GB
  • --large-chunk - Large files are processed in chunks this big.
    • Default: 256MB

Buffers and Memory Options

  • --line-buffer-size - Buffer size for reading lines, this is also the maximum line length
    • Default: 64KB
  • --read-file-buffer-size - Buffer size for reading input files
    • Default: 64KB
  • --temp-file-buffer-size - Buffer size for writing temp files
    • Default: 128KB
  • --output-file-buffer-size - Buffer size for writing final output file
    • Default: 256KB
  • --max-outstanding-sorted-chunks - Number of chunks to buffer in memory when writing
    • Default: 2
    • Major contributor to memory usage if your CPU sorts faster than your disk can write
  • --aggressive-memory-collection - Does a full garbage collection after each file processed
    • This does not reduce the amount of memory required (make buffer sizes or number of workers smaller for that). It helps keep excess memory usage minimal.