Fix switch/flip error rate calculations#131
Fix switch/flip error rate calculations#131JosephLalli wants to merge 1 commit intoodelaneau:mainfrom
Conversation
The previous code counted a string of errors "0 1 1 1 0" as two flips, two switch errors, and one correct variant. For researchers who want to examine the relative rate of flip events and switch errors unrelated to flips, this is unhelpful. Instead, I would argue this represents one flip event, three switch errors, two correct variants, and zero "pure switch errors" (ie isolated switch errors that are not associated with a flip error). This code change calculates flip events, switch errors, pure switch errors, and correct variants. It also calculated how many flip errors occur consecutively (ie, more than two errors in sequence).
There was a problem hiding this comment.
Pull request overview
Updates haplotype_checker’s per-sample reporting to distinguish flip events from switch errors (including “pure” switch errors) to better support downstream analysis of phasing error patterns.
Changes:
- Reworks
writeFlipSwitchErrorPerSample()to compute flip events, total switch errors, pure switch errors, and correct variants. - Adds tracking for “consecutive flip” runs and prints new overall summary rates.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int total = n_all_switches + n_flips + n_correct + n_consecutive_flips; | ||
| int n_pure_switches = n_all_switches - (n_flips*2 + n_consecutive_flips); | ||
| whole_study_errors += n_all_switches; | ||
| whole_study_true_errors += n_pure_switches; | ||
| whole_study_checked += total; | ||
|
|
||
| fdo << H.vecSamples[H.IDXesti[i]] << " " << n_all_switches << " " << n_flips << " " << n_consecutive_flips << " " << n_pure_switches << " " << n_correct << " " << total << " " << stb.str(n_all_switches * 100.0f / total, 2) << " " << stb.str((n_flips*2 + n_consecutive_flips) * 100.0f / total, 2) << " " << stb.str(n_pure_switches * 100.0f / total, 4) << " " << stb.str(n_correct * 100.0f / total, 2) << endl; | ||
| } | ||
| fdo.close(); | ||
| vrb.bullet("#Overall switch error rate = " + stb.str(whole_study_errors * 100.0f / whole_study_checked, 5)); | ||
| vrb.bullet("#Overall pure switch error rate = " + stb.str(whole_study_true_errors * 100.0f / whole_study_checked, 5)); |
There was a problem hiding this comment.
There are multiple divisions by total / whole_study_checked that can be zero when a sample has no Checked variants (so HET is empty) or when no samples contribute any checked variants. This will produce a divide-by-zero at output time. Please add guards (e.g., emit zeros/NA for that sample and skip adding to whole-study totals; and only print overall rates if whole_study_checked > 0).
| int total = n_all_switches + n_flips + n_correct + n_consecutive_flips; | ||
| int n_pure_switches = n_all_switches - (n_flips*2 + n_consecutive_flips); | ||
| whole_study_errors += n_all_switches; | ||
| whole_study_true_errors += n_pure_switches; | ||
| whole_study_checked += total; | ||
|
|
||
| fdo << H.vecSamples[H.IDXesti[i]] << " " << n_all_switches << " " << n_flips << " " << n_consecutive_flips << " " << n_pure_switches << " " << n_correct << " " << total << " " << stb.str(n_all_switches * 100.0f / total, 2) << " " << stb.str((n_flips*2 + n_consecutive_flips) * 100.0f / total, 2) << " " << stb.str(n_pure_switches * 100.0f / total, 4) << " " << stb.str(n_correct * 100.0f / total, 2) << endl; |
There was a problem hiding this comment.
total is computed as n_all_switches + n_flips + n_correct + n_consecutive_flips, which mixes per-variant switch outcomes (n_all_switches, n_correct) with event-based counts (n_flips, n_consecutive_flips). This inflates the denominator and makes the reported per-sample/overall switch error rates inconsistent with check()/writePerSample() (which use sum(Checked) as the denominator). Consider defining total_checked as HET.size() (or n_all_switches + n_correct) and using that for percentages and whole_study_checked, while keeping flip-event counts as separate columns.
| int total = n_all_switches + n_flips + n_correct + n_consecutive_flips; | |
| int n_pure_switches = n_all_switches - (n_flips*2 + n_consecutive_flips); | |
| whole_study_errors += n_all_switches; | |
| whole_study_true_errors += n_pure_switches; | |
| whole_study_checked += total; | |
| fdo << H.vecSamples[H.IDXesti[i]] << " " << n_all_switches << " " << n_flips << " " << n_consecutive_flips << " " << n_pure_switches << " " << n_correct << " " << total << " " << stb.str(n_all_switches * 100.0f / total, 2) << " " << stb.str((n_flips*2 + n_consecutive_flips) * 100.0f / total, 2) << " " << stb.str(n_pure_switches * 100.0f / total, 4) << " " << stb.str(n_correct * 100.0f / total, 2) << endl; | |
| // Number of checked heterozygous variants for this sample | |
| int total_checked = HET.size(); | |
| int n_pure_switches = n_all_switches - (n_flips*2 + n_consecutive_flips); | |
| whole_study_errors += n_all_switches; | |
| whole_study_true_errors += n_pure_switches; | |
| whole_study_checked += total_checked; | |
| fdo << H.vecSamples[H.IDXesti[i]] << " " << n_all_switches << " " << n_flips << " " << n_consecutive_flips << " " << n_pure_switches << " " << n_correct << " " << total_checked << " " << stb.str(n_all_switches * 100.0f / total_checked, 2) << " " << stb.str((n_flips*2 + n_consecutive_flips) * 100.0f / total_checked, 2) << " " << stb.str(n_pure_switches * 100.0f / total_checked, 4) << " " << stb.str(n_correct * 100.0f / total_checked, 2) << endl; |
The previous code counted a string of errors "0 1 1 1 0" as two flips, two switch errors, and one correct variant. For researchers who want to examine the relative rate of flip events and switch errors unrelated to flips, this is unhelpful.
Instead, I would argue this represents one flip event, three switch errors, two correct variants, and zero "pure switch errors" (ie isolated switch errors that are not associated with a flip error).
This code change calculates flip events, switch errors, pure switch errors, and correct variants. It also calculated how many flip errors occur consecutively (ie, more than two errors in sequence).
Re-done post rebase.