Skip to content

Conversation

@Dan-Flores
Copy link
Contributor

@Dan-Flores Dan-Flores commented Nov 26, 2025

This PR creates a benchmark to compare VideoEncoder against FFmpeg CLI. These tools aren't one-to-one, so some assumptions are made:

For VideoEncoder, we use this simple workflow:

encoder = VideoEncoder(frames=frames, frame_rate=30)
encoder.to_file(dest=output_path, codec="h264_nvenc", extra_options={"qp": 1})

For FFmpeg CLI, we count the time used to write frames from a tensor to a file if the flag is used: --write_frames

if write_frames:
	raw_frames = frames.permute(0, 2, 3, 1).contiguous()[:num_frames]
    	with open(raw_path, "wb") as f:
        	f.write(raw_frames.cpu().numpy().tobytes())
        
ffmpeg_cmd = [...]
subprocess.run(ffmpeg_cmd, check=True, capture_output=True)

Result Summary:

  • VideoEncoder shows better performance on GPU + CPU.
    • When the time required to write frames to bytes is added, FFmpeg CLI is much slower.
  • On GPU, VideoEncoder shows a significant speed improvement, up to 3.5x faster than FFmpeg CLI for decoding 30 frames, without adding the time required to write frames to bytes.
    • NVENC utilization is higher for VideoEncoder, while Median GPU memory used values are the same.
  • On CPU, FFmpeg CLI has a slight edge without adding the time required to write frames to bytes. Otherwise, VideoEncoder is significantly faster.
    • I suspect there are optimizations we could make in VideoEncoder::encode to close the gap, but lets land this benchmark as is.
Details All benchmarks are run using a 1280x720 video: Command to generate video: `ffmpeg -f lavfi -i testsrc2=duration=600:size=1280x720:rate=30 -c:v libx264 -pix_fmt yuv420p test/resources/testsrc2_10min.mp4`

Benchmarking nasa_13013.mp4, writing frames in FFmpeg

$ python benchmarks/encoders/benchmark_encoders.py

Benchmarking 390 frames from nasa_13013.mp4 over 30 runs:
Decoded 390 frames of size 270x480

VideoEncoder on GPU   med = 119.26 ms, max = 122.06 ms, fps = 3270.1
GPU memory used:      med = 1231.0 MB, max = 1231.0 MB
NVENC utilization:    med = 30.0%,     max = 38.0%

FFmpeg CLI on GPU     med = 1174.55 ms, max = 1524.59 ms, fps = 332.0
GPU memory used:      med = 1231.0 MB, max = 1231.0 MB
NVENC utilization:    med = 15.0%,     max = 22.0%

VideoEncoder on CPU   med = 408.43 ms, max = 454.66 ms, fps = 954.9

FFmpeg CLI on CPU     med = 1184.47 ms, max = 1219.28 ms, fps = 329.3

Benchmarking nasa_13013.mp4, with --skip-write-frames

$ python benchmarks/encoders/benchmark_encoders.py --skip-write-frames

Benchmarking 390 frames from nasa_13013.mp4 over 30 runs:
Decoded 390 frames of size 270x480

VideoEncoder on GPU   med = 120.21 ms, max = 122.40 ms, fps = 3244.4
GPU memory used:      med = 1231.0 MB, max = 1231.0 MB
NVENC utilization:    med = 26.0%,     max = 39.0%

FFmpeg CLI on GPU     med = 419.66 ms, max = 1189.17 ms, fps = 929.3
GPU memory used:      med = 1231.0 MB, max = 1231.0 MB
NVENC utilization:    med = 18.0%,     max = 23.0%

VideoEncoder on CPU   med = 408.86 ms, max = 449.01 ms, fps = 953.9

FFmpeg CLI on CPU     med = 383.65 ms, max = 410.91 ms, fps = 1016.5

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 26, 2025
@Dan-Flores Dan-Flores force-pushed the test_gpu_benchmarking branch from e4b6d52 to 743b664 Compare December 2, 2025 14:23
@Dan-Flores Dan-Flores changed the title [wip] benchmark encoding Benchmark encoding against ffmpeg cli Dec 18, 2025
@Dan-Flores Dan-Flores marked this pull request as ready for review December 18, 2025 14:46
def encode_torchcodec(frames, output_path, device="cpu"):
encoder = VideoEncoder(frames=frames, frame_rate=30)
if device == "cuda":
encoder.to_file(dest=output_path, codec="h264_nvenc", extra_options={"qp": 1})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we currently using qp=1 for torchcodec encoder vs qp=0 for ffmpeg cli? (line 155)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and we should not be, thanks for catching this!

@mollyxu
Copy link
Contributor

mollyxu commented Dec 18, 2025

Great work on the benchmarks @Dan-Flores! I liked the detailed analysis of the results. I left two clarifying questions.

Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dan-Flores , this looks good!

self.metrics = {
"utilization": [s["utilization"] for s in samples],
"memory_used": [s["memory_used"] for s in samples],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On NVENCMonitor above, I think we might want to use pynvml instead, as done e.g. in P1984513849.

The main reason is that NVENCMonitor is sampling utilization value every 50ms, which isn't exactly in sync with the number of iterations in the loop. That is, the returned nvenc_tensor doesn't contain the same amount of values the times tensor, and so their reported values aren't averaged over the same amount of experiments either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example, I'll update to use pynvml.

I see how arbitrarily selecting 50ms will not get the same amount of values as the times tensor, but I don't completely understand how pynvml.nvmlDeviceGetDecoderUtilization manages it. It seems like it is always sampling the device for usage, then when called returns a single median/max/average over an automatically determined sampling period when called?

Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dan-Flores !

Comment on lines 164 to 166
# By default, frames will be written inside the benchmark function
if args.skip_write_frames:
write_raw_frames(frames, str(raw_frames_path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol this makes sense but it's a bit surprising to read "if skip write frames, then write frames". Here's a suggestion below.

Suggested change
# By default, frames will be written inside the benchmark function
if args.skip_write_frames:
write_raw_frames(frames, str(raw_frames_path))
# If skip_write_frames is True, then we don't benchmark the time it takes to write the frames.
# But we still need to write them for FFmpeg to find them!
if args.skip_write_frames:
write_raw_frames(frames, str(raw_frames_path))

return times_tensor, {
"utilization": torch.tensor(utilizations).float() if gpu_monitoring else None,
"memory_used": torch.tensor(memory_usage).float() if gpu_monitoring else None,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just replying to #1074 (comment) here for visibility:

It seems like it is always sampling the device for usage, then when called returns a single median/max/average over an automatically determined sampling period when called?

Yes, this is also my understanding. And I think this is also what was happening with your previous implementation with nvidia-smi!

I think there are two main variables:

  • The "query frequency", i.e. the frequency at which we call nvml/nvidia-smi. In your previous implementation, this was every 50ms. In the current implementation, it's every time we enter the for _ in range(average_over) loop.
  • The "sampling period" over which nvml/nvidia-smi average and report their results. It's a different variable from the query frequency! And, as far as I can tell, we do not have control over this one. I can't find a relevant parameter for it, and claude says it's really determined by the underlying driver (which I'm inclined to believe)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants