A simple, reproducible Jupyter workflow to compute overall GC% and sliding-window GC profiles from any DNA FASTA file. GC% analysis is a foundational genomics quality-control step: it helps spot compositional biases, sequencing artefacts, and GC-rich/poor regions that often map to functional elements.
- Notebooks: toy demo + real FASTA analyses
- Outputs: publication-ready figures (PNG) and tables (CSV)
- Inputs: any .fasta / .fa / .fna sequence
This project shows, end-to-end, how to:
- Compute overall GC% for a sequence
- Generate a sliding-window GC% profile across the genome
- Save results as CSV tables and figures for reports and QC
It starts with a toy sequence (for clarity), then extends the pipeline to real genomes:
- Human mitochondrial DNA (rCRS, NC_012920.1)
- Escherichia coli K-12 MG1655 (NC_000913.3)
The notebooks are beginner-friendly and designed to be adapted to any FASTA input.
Environment: Python (Jupyter), standard library + Matplotlib.
-
FASTA parsing
- Lightweight reader concatenates sequence lines, extracts header.
-
GC% computation
- Overall GC% = (G + C) / (A + C + G + T) × 100
- Sliding windows: configurable
window&stepto compute GC% per window.
-
Outputs
- CSV of all windows; top/bottom 10 windows by GC%
- Figures: line plot (GC% vs position) and single-bar overall GC%
-
Real genomes
- FASTA retrieved from NCBI (RefSeq accessions below).
- Parameters tuned to genome size (e.g., mtDNA: smaller window; E. coli: larger).
- GC% profile:

- Overall GC:

- Tables:
- All windows:
results/mtDNA/gc_windows.csv - Extremes (top/bottom 10):
results/mtDNA/gc_extremes.csv
- All windows:
- GC% profile:

- Overall GC:

- Tables:
- All windows:
results/ecoli/gc_windows.csv - Extremes (top/bottom 10):
results/ecoli/gc_extremes.csv
- All windows:
- Open the notebooks in Jupyter (Anaconda → Jupyter).
- In the “analyse_real_FASTA” notebook, set
fasta_pathto your.fasta/.fa/.fna. - Choose
window/step(smaller for short genomes; larger for long chromosomes). - Run all cells. Outputs (PNGs/CSVs) will be written under
results/.
Tip: If the single bar looks too “thick,” set
xlim(e.g.,plt.xlim(-1, 1)) so the width is visible.
- Add multi-FASTA support to batch-run many sequences.
- Overlay GC% with annotation tracks (genes, CDS, CpG islands) for interpretation.
- Package the notebook logic as a small Python module/CLI for easier reuse.
- Human mitochondrion (rCRS) — NCBI RefSeq NC_012920.1
FASTA viewer: https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=NC_012920.1&db=nuccore&report=fasta - Escherichia coli K-12 substr. MG1655 — NCBI RefSeq NC_000913.3
FASTA viewer: https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=NC_000913.3&db=nuccore&report=fasta
This repository is open-sourced under the MIT License. See LICENSE.

