Genomic Motif Discovery Assignment

Course: COBI 8301 Computational Genomics
Time Limit: 1.5 hours
Total Points: 100 points

Overview

You will implement a bioinformatics pipeline that discovers transcription factor binding sites (motifs) in DNA sequences. This assignment integrates suffix arrays, Bayesian inference, and EM algorithms in a realistic computational biology application.

Learning Objectives

By completing this assignment, you will:

Apply suffix arrays to efficiently search biological sequences
Implement the EM algorithm for probabilistic model learning
Use Bayes theorem for pattern scoring and validation
Connect string algorithms with machine learning methods

Dataset Description

You are provided with 100 DNA sequences (600-1000 bp each) from ChIP-seq peak regions. These sequences contain planted transcription factor binding motifs for SP1: GC-rich binding site (consensus: GGGCGGGGC).

Files Provided

motif_data/
├── chip_seq_peaks.fasta          # Input sequences
├── background_model.txt          # Genome background frequencies
├── ground_truth.txt              # Planted motif locations (for validation)
├── motif_definitions.txt         # Known PWM definitions
└── dataset_stats.txt             # Dataset summary

Starter code:

motif_discovery.py - Main template with function stubs
dataset_generator.py - Script that generated the data

Assignment Tasks

Part 1: Suffix Array Construction (30 points)

Implement efficient k-mer discovery using suffix arrays:

Function: build_suffix_array(sequences)

Concatenate sequences with unique separators ($1, $2, etc.)
Build suffix array using sorting
Return concatenated text, suffix array, and sequence boundaries

Function: find_frequent_kmers(suffix_array, text, k, min_count)

Use suffix array to find all k-mers occurring ≥ min_count times
Skip k-mers containing separator characters
Return list of (kmer, count, positions)

Function: get_motif_candidate(sequences, k=8, top_n=5)

Score k-mers by frequency and distribution across sequences
Return top candidate for EM refinement

Part 2: EM Algorithm Implementation (30 points)

Implement Position Weight Matrix learning using EM:

Function: initialize_pwm(motif_sites)

Create PWM from aligned sequences with pseudocounts
Return PWM[position][nucleotide] = probability

Function: calculate_motif_probability(sequence, pwm)

Calculate P(sequence|motif) using PWM

Function: em_motif_refinement(sequences, initial_sites, max_iterations=20)

Track likelihood improvement and check convergence
Return final PWM, sites, and likelihood history

Part 3: Bayesian Scoring (15 points)

Implement motif validation using Bayesian methods:

Function: validate_motif(sequences, pwm, shuffle_trials=100)

Compare real motif score vs. shuffled sequence scores
Return p-value (fraction of shuffled scores ≥ real score)
P-value < 0.05 indicates significant motif

Part 4: Analysis Report (25 points)

Write a brief analysis addressing

Algorithm Connection: How did suffix arrays accelerate motif discovery compared to naive enumeration?
EM Convergence: Describe the EM algorithm's convergence behavior. How much did the likelihood improve?
Biological Interpretation: What motif did you discover? How does it compare to the ground truth?
Pipeline: In this example, we knew that we were searching for a 9-mer, which simplified our analysis. How would you work on an algorithm where you did not know the length of the motif?

Implementation Tips

Suffix Array Construction

# Efficient suffix array building
def build_suffix_array(sequences):
    # Concatenate with separators
    text = ""
    boundaries = []
    for i, seq_data in enumerate(sequences):
        boundaries.append(len(text))
        text += seq_data['sequence'] + f"${i+1}"
    
    # Build suffix array
    suffix_array = sorted(range(len(text)), key=lambda i: text[i:])
    return text, suffix_array, boundaries

EM Algorithm Structure

def em_motif_refinement(sequences, initial_sites, max_iterations=20):
    current_pwm = initialize_pwm(initial_sites)
    
    for iteration in range(max_iterations):
        # E-step: Find best positions
        new_sites = []
        for seq_data in sequences:
            score, pos = find_best_motif_position(seq_data['sequence'], current_pwm)
            new_sites.append(seq_data['sequence'][pos:pos+motif_length])
        
        # M-step: Update PWM
        new_pwm = initialize_pwm(new_sites)
        
        # Check convergence
        if converged(current_pwm, new_pwm):
            break
        current_pwm = new_pwm
    
    return current_pwm, new_sites, likelihood_history

Submission Requirements

Submit the following files:

motif_discovery.py - Your completed implementation
analysis_report.txt - Your written analysis

Academic Integrity

You may discuss algorithms and approaches with classmates
Implementation must be your own work
Cite any external resources used
Do not share code or look at others' implementations

Good luck with your motif discovery adventure!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
motif_data		motif_data
README.md		README.md
dataset_generator.py		dataset_generator.py
motif_discovery.py		motif_discovery.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Genomic Motif Discovery Assignment

Overview

Learning Objectives

Dataset Description

Files Provided

Assignment Tasks

Part 1: Suffix Array Construction (30 points)

Part 2: EM Algorithm Implementation (30 points)

Part 3: Bayesian Scoring (15 points)

Part 4: Analysis Report (25 points)

Implementation Tips

Suffix Array Construction

EM Algorithm Structure

Submission Requirements

Academic Integrity

About

Uh oh!

Releases

Packages

Languages

uvacobi/tfbs_lab

Folders and files

Latest commit

History

Repository files navigation

Genomic Motif Discovery Assignment

Overview

Learning Objectives

Dataset Description

Files Provided

Assignment Tasks

Part 1: Suffix Array Construction (30 points)

Part 2: EM Algorithm Implementation (30 points)

Part 3: Bayesian Scoring (15 points)

Part 4: Analysis Report (25 points)

Implementation Tips

Suffix Array Construction

EM Algorithm Structure

Submission Requirements

Academic Integrity

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages