-
Notifications
You must be signed in to change notification settings - Fork 82
Benchmark encoding against ffmpeg cli #1074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e4b6d52 to
743b664
Compare
…into test_gpu_benchmarking
…o test_gpu_benchmarking
…/torchcodec into test_gpu_benchmarking
| def encode_torchcodec(frames, output_path, device="cpu"): | ||
| encoder = VideoEncoder(frames=frames, frame_rate=30) | ||
| if device == "cuda": | ||
| encoder.to_file(dest=output_path, codec="h264_nvenc", extra_options={"qp": 1}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we currently using qp=1 for torchcodec encoder vs qp=0 for ffmpeg cli? (line 155)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and we should not be, thanks for catching this!
|
Great work on the benchmarks @Dan-Flores! I liked the detailed analysis of the results. I left two clarifying questions. |
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dan-Flores , this looks good!
| self.metrics = { | ||
| "utilization": [s["utilization"] for s in samples], | ||
| "memory_used": [s["memory_used"] for s in samples], | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On NVENCMonitor above, I think we might want to use pynvml instead, as done e.g. in P1984513849.
The main reason is that NVENCMonitor is sampling utilization value every 50ms, which isn't exactly in sync with the number of iterations in the loop. That is, the returned nvenc_tensor doesn't contain the same amount of values the times tensor, and so their reported values aren't averaged over the same amount of experiments either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example, I'll update to use pynvml.
I see how arbitrarily selecting 50ms will not get the same amount of values as the times tensor, but I don't completely understand how pynvml.nvmlDeviceGetDecoderUtilization manages it. It seems like it is always sampling the device for usage, then when called returns a single median/max/average over an automatically determined sampling period when called?
NicolasHug
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dan-Flores !
| # By default, frames will be written inside the benchmark function | ||
| if args.skip_write_frames: | ||
| write_raw_frames(frames, str(raw_frames_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol this makes sense but it's a bit surprising to read "if skip write frames, then write frames". Here's a suggestion below.
| # By default, frames will be written inside the benchmark function | |
| if args.skip_write_frames: | |
| write_raw_frames(frames, str(raw_frames_path)) | |
| # If skip_write_frames is True, then we don't benchmark the time it takes to write the frames. | |
| # But we still need to write them for FFmpeg to find them! | |
| if args.skip_write_frames: | |
| write_raw_frames(frames, str(raw_frames_path)) |
| return times_tensor, { | ||
| "utilization": torch.tensor(utilizations).float() if gpu_monitoring else None, | ||
| "memory_used": torch.tensor(memory_usage).float() if gpu_monitoring else None, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just replying to #1074 (comment) here for visibility:
It seems like it is always sampling the device for usage, then when called returns a single median/max/average over an automatically determined sampling period when called?
Yes, this is also my understanding. And I think this is also what was happening with your previous implementation with nvidia-smi!
I think there are two main variables:
- The "query frequency", i.e. the frequency at which we call nvml/nvidia-smi. In your previous implementation, this was every 50ms. In the current implementation, it's every time we enter the
for _ in range(average_over)loop. - The "sampling period" over which nvml/nvidia-smi average and report their results. It's a different variable from the query frequency! And, as far as I can tell, we do not have control over this one. I can't find a relevant parameter for it, and claude says it's really determined by the underlying driver (which I'm inclined to believe)
This PR creates a benchmark to compare VideoEncoder against FFmpeg CLI. These tools aren't one-to-one, so some assumptions are made:
For VideoEncoder, we use this simple workflow:
For FFmpeg CLI, we count the time used to write frames from a tensor to a file if the flag is used:
--write_framesResult Summary:
VideoEncodershows better performance on GPU + CPU.VideoEncodershows a significant speed improvement, up to 3.5x faster than FFmpeg CLI for decoding 30 frames, without adding the time required to write frames to bytes.VideoEncoder, while Median GPU memory used values are the same.VideoEncoderis significantly faster.VideoEncoder::encodeto close the gap, but lets land this benchmark as is.Details
All benchmarks are run using a 1280x720 video: Command to generate video: `ffmpeg -f lavfi -i testsrc2=duration=600:size=1280x720:rate=30 -c:v libx264 -pix_fmt yuv420p test/resources/testsrc2_10min.mp4`Benchmarking
nasa_13013.mp4, writing frames in FFmpeg$
python benchmarks/encoders/benchmark_encoders.pyBenchmarking
nasa_13013.mp4, with--skip-write-frames$
python benchmarks/encoders/benchmark_encoders.py --skip-write-frames