Skip to content

Enable zstd compression format#640

Open
GilbertHan1011 wants to merge 5 commits intoOpenGene:masterfrom
GilbertHan1011:master
Open

Enable zstd compression format#640
GilbertHan1011 wants to merge 5 commits intoOpenGene:masterfrom
GilbertHan1011:master

Conversation

@GilbertHan1011
Copy link
Copy Markdown

zstdandard(zstd) is a compression algorithm with higher performance than zlib. This contribution enabled fastp to handle zstd format

Copy link
Copy Markdown
Contributor

@bwlang bwlang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfchen I tested this locally. It compiled and worked as expected for reading zstd compressed inputs - including with larger files (10k reads). I also tested interleaved fastq.zst files successfully

I think it could be merged, maybe without the .gitignore modification.

.gitignore Outdated
*.html No newline at end of file
*.html

out/ No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a local preference

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This have been updated.

@KimBioInfoStudio
Copy link
Copy Markdown
Member

we r working on this, BTW which downstream sw support consume fq.zst?

@GilbertHan1011
Copy link
Copy Markdown
Author

@KimBioInfoStudio
Copy link
Copy Markdown
Member

seems, most upstream basecalling not support output r1.fq.zst even BGZF, welcome pr, could u rebase it to latest master branch, BTW, we r seeking a better format which support indexed multi members and read parallelism

@GilbertHan1011
Copy link
Copy Markdown
Author

seems, most upstream basecalling not support output r1.fq.zst even BGZF, welcome pr, could u rebase it to latest master branch, BTW, we r seeking a better format which support indexed multi members and read parallelism

Done.
I’m also very interested in that kind of format. I'm planing work on it.

Copy link
Copy Markdown
Member

@KimBioInfoStudio KimBioInfoStudio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls use lfs track those files?

@GilbertHan1011
Copy link
Copy Markdown
Author

I tried Git LFS, but GitHub does not allow uploading new LFS objects to my public fork (rejected by server).
So I removed the .zst testdata files from the PR to avoid adding binary blobs.

@KimBioInfoStudio
Copy link
Copy Markdown
Member

KimBioInfoStudio commented Mar 12, 2026

could u help with one group of bench as evidence

  1. xxx.fq.zst win at compression ratio
  2. no perf regression compare to .fq.gz -> .fq.gz
  3. cover pe and se
  4. ge 10M reads/pairs

@GilbertHan1011
Copy link
Copy Markdown
Author

fastp_zstd_benchmark_report_maxcomp.pdf
@KimBioInfoStudio

@KimBioInfoStudio
Copy link
Copy Markdown
Member

@sfchen plz kindly consider merge this pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants