Skip to content

feat: auto-extract multi-language DVB subtitles into per-language files (#447)#2243

Open
ujjwalr27 wants to merge 2 commits intoCCExtractor:masterfrom
ujjwalr27:feature/multi-dvb-subtitle-extraction
Open

feat: auto-extract multi-language DVB subtitles into per-language files (#447)#2243
ujjwalr27 wants to merge 2 commits intoCCExtractor:masterfrom
ujjwalr27:feature/multi-dvb-subtitle-extraction

Conversation

@ujjwalr27
Copy link
Copy Markdown
Contributor

[FEATURE] Auto-extract multi-language DVB subtitles into per-language files

Closes #447

In raising this pull request, I confirm the following (please check boxes):

Reason for this PR:

  • This PR adds new functionality.
  • This PR fixes a bug that I have personally experienced or that a real user has reported and for which a sample exists.
  • This PR is porting code from C to Rust.

Sanity check:

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • If the PR adds new functionality, I've added it to the changelog. If it's just a bug fix, I have NOT added it to the changelog.
  • I am NOT adding new C code unless it's to fix an existing, reproducible bug.

⚠️ This PR adds new C code for a feature requested in #447 by a real user, with a provided sample file.


Description

Implements #447 — when a DVB/TS recording contains multiple DVB subtitle streams, CCExtractor now automatically detects each stream and writes subtitles to separate files named by ISO-639 language code. No manual configuration or pre-inspection of the file is required.

Before:

ccextractor arte_multiaudio.ts
# → arte_multiaudio.srt   (only first/primary stream extracted)
# French DVB subtitle stream silently ignored

After:

ccextractor arte_multiaudio.ts
# → arte_multiaudio.srt        (teletext / primary stream)
# → arte_multiaudio_fra.srt    (French DVB subtitles, auto-detected)

No new CLI flags. Fully automatic. Single-stream recordings are unaffected.


Repro Instructions

Test 1 — arte_multiaudio.ts (from issue #447)

Download: https://www.dropbox.com/s/5oaqnjgqq1cqzky/arte_multiaudio.ts?dl=0

The file contains:

PID Type Language
0x103 (259) DVB Teletext deu
0x104 (260) DVB Subtitle deu (no bitmap packets in this recording)
0x106 (262) DVB Subtitle fra

Before this PR (on master):

./ccextractor arte_multiaudio.ts
# Only produces arte_multiaudio.srt (teletext)
# French DVB subtitle stream is silently ignored

After this PR:

./ccextractor arte_multiaudio.ts
DVB subtitle PID 260 language: deu
DVB subtitle PID 262 language: fra
...
-rw-r--r-- 4106 arte_multiaudio.srt       <- Teletext subtitles
-rw-r--r-- 3924 arte_multiaudio_fra.srt   <- DVB bitmap subtitles (fra, newly extracted)
Exit code: 0

Also verified with --codec dvbsub:

./ccextractor arte_multiaudio.ts --codec dvbsub
# → arte_multiaudio_fra.srt  (3924 bytes)
# Exit code: 0

Test 2 — DVB-only file with two subtitle streams (deu + fra)

A recording with no teletext, only two DVB subtitle PIDs:

PID Type Language
index 2 DVB Subtitle deu
index 3 DVB Subtitle fra
./ccextractor test_two_dvb.ts
DVB subtitle PID ... language: deu
DVB subtitle PID ... language: fra
...
-rw-r--r-- 3924 test_two_dvb_deu.srt   <- German-tagged DVB subtitles
-rw-r--r-- 3924 test_two_dvb_fra.srt   <- French-tagged DVB subtitles
Exit code: 0

Both files are produced automatically in a single pass, with no flags or prior knowledge of how many subtitle streams exist.


Implementation

Files changed

File Change
src/lib_ccx/ccx_demuxer.h Add char lang[4] to cap_info struct
src/lib_ccx/ts_tables.c Parse ISO-639 code from DVB subtitle descriptor in PMT
src/lib_ccx/ts_info.c Propagate lang in update_capinfo(); protect DVB streams from ignore_other_stream()
src/lib_ccx/lib_ccx.c Per-PID encoder/decoder routing; fix two segfaults in cleanup
src/lib_ccx/general_loop.c Secondary loop to process all non-primary DVB subtitle PIDs
src/rust/src/demuxer/common_types.rs Add lang: [i8; 4] to CapInfo
src/rust/src/ctorust.rs Propagate lang in FromCType<cap_info>
src/rust/src/common.rs Propagate lang in CType<cap_info>

Key design decisions

Per-PID decoders in single-program mode
Each DVB subtitle PID has its own DVBSubContext with different composition_id/ancillary_id from the PMT. The existing single-decoder model was extended to always create a fresh decoder per DVB PID.

Language-tagged output filenames
update_encoder_list_cinfo() uses cinfo->lang to suffix the output filename, matching existing behaviour for multi-program mode.

Separate encoder/decoder cleanup
dinit_libraries() previously matched encoders by program number inside the decoder loop — with multiple DVB encoders sharing the same program number this caused double-free on exit. Fixed by splitting into two independent passes.

dec_ctx->prev zero-initialization
dec_ctx->prev was malloc'd but not memset; free_decoder_context() during cleanup freed garbage pointers. Fixed with memset(prev, 0, sizeof(...)).

@ccextractor-bot
Copy link
Copy Markdown
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit d56a6be...:
Report Name Tests Passed
Broken 9/13
CEA-708 1/14
DVB 2/7
DVD 3/3
DVR-MS 2/2
General 20/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 76/86
Teletext 20/21
WTV 13/13
XDS 31/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 8e8229b88b...
  • ccextractor --autoprogram --out=ttxt --latin1 1020459a86...
  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
  • ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9...
  • ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc...
  • ccextractor --autoprogram --out=srt --latin1 b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla 7aad20907e...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
  • ccextractor --autoprogram --out=ttxt --latin1 01509e4d27...
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
  • ccextractor --out=spupng c83f765c66...
  • ccextractor --codec dvbsub --out=spupng 85271be4d2...
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --autoprogram --out=ttxt --xds --latin1 --ucla 85058ad37e...
  • ccextractor --autoprogram --out=srt --latin1 --ucla b22260d065...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla --xds 7f41299cc7...

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@ccextractor-bot
Copy link
Copy Markdown
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit d56a6be...:
Report Name Tests Passed
Broken 9/13
CEA-708 1/14
DVB 2/7
DVD 3/3
DVR-MS 2/2
General 22/27
Hardsubx 0/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 79/86
Teletext 20/21
WTV 13/13
XDS 31/34

Your PR breaks these cases:

NOTE: The following tests have been failing on the master branch as well as the PR:

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
  • ccextractor --out=spupng c83f765c66..., Last passed: Never
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

Copy link
Copy Markdown
Contributor

@cfsmp3 cfsmp3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review Results

First off — this is a really well-done PR. The description is excellent, the repro instructions are clear, and the code is clean. This is the quality level we want from all contributors.

What works well

  • Feature works correctly: The arte_multiaudio.ts sample now produces both arte.srt (teletext) and arte_fra.srt (French DVB) in a single pass with no flags needed.
  • No repeating subtitles: The old attempt at this feature (PRs #1912/#2048/#2051/#2058) had bugs where subtitles repeated or timestamps started at zero. None of those bugs are present here.
  • Content is byte-identical to master on all existing single-stream samples — the decoding logic is correct.
  • Cleanup fixes are good: The split encoder/decoder cleanup, the memset for dec_ctx->prev, and the transcript_settings deep-copy all fix real issues.
  • Output with -o flag works correctly.
  • Tested across 12+ samples (CEA-608, DVB, DVR-MS, ASF, MP4, TS, MPG) — zero content regressions.

Issue found: filename regression on single-DVB-stream files

We ran all 25 CI test cases locally on both master and this PR. On 3 tests, the PR changes the output filename by adding a language suffix (_eng) even when there's only a single DVB stream:

Test Master filename PR filename Content
1020459a86 --autoprogram --out=ttxt output.out output_eng.txt Byte-identical
85271be4d2 --autoprogram --out=srt --quant 0 output.out output_eng.srt Byte-identical
85271be4d2 --codec dvbsub --out=spupng output.out + output.d/ output_eng.xml + output_eng.d/ Byte-identical (all 28 PNGs)

The content is correct — only the filename changes. But this breaks backward compatibility for existing users/scripts that expect the original filename.

Fix: Only add the language suffix when the program has 2 or more DVB subtitle PIDs. Single-DVB-stream recordings should keep the original filename.

Also needed

  • Add a CHANGES.TXT entry — this is a user-facing feature.

Everything else looks good. Once the filename issue is fixed, this is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

multi lang, each cc into a new file

3 participants