Skip to content

uvacobi/tfbs_lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Genomic Motif Discovery Assignment

Course: COBI 8301 Computational Genomics
Time Limit: 1.5 hours
Total Points: 100 points

Overview

You will implement a bioinformatics pipeline that discovers transcription factor binding sites (motifs) in DNA sequences. This assignment integrates suffix arrays, Bayesian inference, and EM algorithms in a realistic computational biology application.

Learning Objectives

By completing this assignment, you will:

  • Apply suffix arrays to efficiently search biological sequences
  • Implement the EM algorithm for probabilistic model learning
  • Use Bayes theorem for pattern scoring and validation
  • Connect string algorithms with machine learning methods

Dataset Description

You are provided with 100 DNA sequences (600-1000 bp each) from ChIP-seq peak regions. These sequences contain planted transcription factor binding motifs for SP1: GC-rich binding site (consensus: GGGCGGGGC).

Files Provided

motif_data/
├── chip_seq_peaks.fasta          # Input sequences
├── background_model.txt          # Genome background frequencies
├── ground_truth.txt              # Planted motif locations (for validation)
├── motif_definitions.txt         # Known PWM definitions
└── dataset_stats.txt             # Dataset summary

Starter code:

  • motif_discovery.py - Main template with function stubs
  • dataset_generator.py - Script that generated the data

Assignment Tasks

Part 1: Suffix Array Construction (30 points)

Implement efficient k-mer discovery using suffix arrays:

Function: build_suffix_array(sequences)

  • Concatenate sequences with unique separators ($1, $2, etc.)
  • Build suffix array using sorting
  • Return concatenated text, suffix array, and sequence boundaries

Function: find_frequent_kmers(suffix_array, text, k, min_count)

  • Use suffix array to find all k-mers occurring ≥ min_count times
  • Skip k-mers containing separator characters
  • Return list of (kmer, count, positions)

Function: get_motif_candidate(sequences, k=8, top_n=5)

  • Score k-mers by frequency and distribution across sequences
  • Return top candidate for EM refinement

Part 2: EM Algorithm Implementation (30 points)

Implement Position Weight Matrix learning using EM:

Function: initialize_pwm(motif_sites)

  • Create PWM from aligned sequences with pseudocounts
  • Return PWM[position][nucleotide] = probability

Function: calculate_motif_probability(sequence, pwm)

  • Calculate P(sequence|motif) using PWM

Function: em_motif_refinement(sequences, initial_sites, max_iterations=20)

  • Track likelihood improvement and check convergence
  • Return final PWM, sites, and likelihood history

Part 3: Bayesian Scoring (15 points)

Implement motif validation using Bayesian methods:

Function: validate_motif(sequences, pwm, shuffle_trials=100)

  • Compare real motif score vs. shuffled sequence scores
  • Return p-value (fraction of shuffled scores ≥ real score)
  • P-value < 0.05 indicates significant motif

Part 4: Analysis Report (25 points)

Write a brief analysis addressing

  1. Algorithm Connection: How did suffix arrays accelerate motif discovery compared to naive enumeration?

  2. EM Convergence: Describe the EM algorithm's convergence behavior. How much did the likelihood improve?

  3. Biological Interpretation: What motif did you discover? How does it compare to the ground truth?

  4. Pipeline: In this example, we knew that we were searching for a 9-mer, which simplified our analysis. How would you work on an algorithm where you did not know the length of the motif?

Implementation Tips

Suffix Array Construction

# Efficient suffix array building
def build_suffix_array(sequences):
    # Concatenate with separators
    text = ""
    boundaries = []
    for i, seq_data in enumerate(sequences):
        boundaries.append(len(text))
        text += seq_data['sequence'] + f"${i+1}"
    
    # Build suffix array
    suffix_array = sorted(range(len(text)), key=lambda i: text[i:])
    return text, suffix_array, boundaries

EM Algorithm Structure

def em_motif_refinement(sequences, initial_sites, max_iterations=20):
    current_pwm = initialize_pwm(initial_sites)
    
    for iteration in range(max_iterations):
        # E-step: Find best positions
        new_sites = []
        for seq_data in sequences:
            score, pos = find_best_motif_position(seq_data['sequence'], current_pwm)
            new_sites.append(seq_data['sequence'][pos:pos+motif_length])
        
        # M-step: Update PWM
        new_pwm = initialize_pwm(new_sites)
        
        # Check convergence
        if converged(current_pwm, new_pwm):
            break
        current_pwm = new_pwm
    
    return current_pwm, new_sites, likelihood_history

Submission Requirements

Submit the following files:

  1. motif_discovery.py - Your completed implementation
  2. analysis_report.txt - Your written analysis

Academic Integrity

  • You may discuss algorithms and approaches with classmates
  • Implementation must be your own work
  • Cite any external resources used
  • Do not share code or look at others' implementations

Good luck with your motif discovery adventure!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages