Course: COBI 8301 Computational Genomics
Time Limit: 1.5 hours
Total Points: 100 points
You will implement a bioinformatics pipeline that discovers transcription factor binding sites (motifs) in DNA sequences. This assignment integrates suffix arrays, Bayesian inference, and EM algorithms in a realistic computational biology application.
By completing this assignment, you will:
- Apply suffix arrays to efficiently search biological sequences
- Implement the EM algorithm for probabilistic model learning
- Use Bayes theorem for pattern scoring and validation
- Connect string algorithms with machine learning methods
You are provided with 100 DNA sequences (600-1000 bp each) from ChIP-seq peak regions. These sequences contain planted transcription factor binding motifs for SP1: GC-rich binding site (consensus: GGGCGGGGC).
motif_data/
├── chip_seq_peaks.fasta # Input sequences
├── background_model.txt # Genome background frequencies
├── ground_truth.txt # Planted motif locations (for validation)
├── motif_definitions.txt # Known PWM definitions
└── dataset_stats.txt # Dataset summary
Starter code:
motif_discovery.py- Main template with function stubsdataset_generator.py- Script that generated the data
Implement efficient k-mer discovery using suffix arrays:
Function: build_suffix_array(sequences)
- Concatenate sequences with unique separators (
$1,$2, etc.) - Build suffix array using sorting
- Return concatenated text, suffix array, and sequence boundaries
Function: find_frequent_kmers(suffix_array, text, k, min_count)
- Use suffix array to find all k-mers occurring ≥ min_count times
- Skip k-mers containing separator characters
- Return list of (kmer, count, positions)
Function: get_motif_candidate(sequences, k=8, top_n=5)
- Score k-mers by frequency and distribution across sequences
- Return top candidate for EM refinement
Implement Position Weight Matrix learning using EM:
Function: initialize_pwm(motif_sites)
- Create PWM from aligned sequences with pseudocounts
- Return PWM[position][nucleotide] = probability
Function: calculate_motif_probability(sequence, pwm)
- Calculate P(sequence|motif) using PWM
Function: em_motif_refinement(sequences, initial_sites, max_iterations=20)
- Track likelihood improvement and check convergence
- Return final PWM, sites, and likelihood history
Implement motif validation using Bayesian methods:
Function: validate_motif(sequences, pwm, shuffle_trials=100)
- Compare real motif score vs. shuffled sequence scores
- Return p-value (fraction of shuffled scores ≥ real score)
- P-value < 0.05 indicates significant motif
Write a brief analysis addressing
-
Algorithm Connection: How did suffix arrays accelerate motif discovery compared to naive enumeration?
-
EM Convergence: Describe the EM algorithm's convergence behavior. How much did the likelihood improve?
-
Biological Interpretation: What motif did you discover? How does it compare to the ground truth?
-
Pipeline: In this example, we knew that we were searching for a 9-mer, which simplified our analysis. How would you work on an algorithm where you did not know the length of the motif?
# Efficient suffix array building
def build_suffix_array(sequences):
# Concatenate with separators
text = ""
boundaries = []
for i, seq_data in enumerate(sequences):
boundaries.append(len(text))
text += seq_data['sequence'] + f"${i+1}"
# Build suffix array
suffix_array = sorted(range(len(text)), key=lambda i: text[i:])
return text, suffix_array, boundariesdef em_motif_refinement(sequences, initial_sites, max_iterations=20):
current_pwm = initialize_pwm(initial_sites)
for iteration in range(max_iterations):
# E-step: Find best positions
new_sites = []
for seq_data in sequences:
score, pos = find_best_motif_position(seq_data['sequence'], current_pwm)
new_sites.append(seq_data['sequence'][pos:pos+motif_length])
# M-step: Update PWM
new_pwm = initialize_pwm(new_sites)
# Check convergence
if converged(current_pwm, new_pwm):
break
current_pwm = new_pwm
return current_pwm, new_sites, likelihood_historySubmit the following files:
motif_discovery.py- Your completed implementationanalysis_report.txt- Your written analysis
- You may discuss algorithms and approaches with classmates
- Implementation must be your own work
- Cite any external resources used
- Do not share code or look at others' implementations
Good luck with your motif discovery adventure!