Skip to content

assadiab/geneprediction-tp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

22 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Gene prediction

Summary

Genes correspond to a subsequence of transcripts that can be translated into proteins by the ribosome. They have a reading frame consisting of consecutive triplets from an initiation codon ('AUG', 'UUG', 'CUG', 'AUU' or 'GUG') and a stop codon (UAA', 'UAG', or 'UGA'). These codons are in the same reading frame! We find upstream of the initiation codon a motif allowing the initiation of translation via the binding of the 16S subunit of ribosomal RNA: AGGAGGUAA called the Shine-Dalagarno sequence [Shine and Dalgarno 1973]. This motif is not necessarily in the same reading frame as the initiation codon and may be incomplete.

Few organisms currently benefit from an experimentally verified annotation. Gene prediction therefore remains an important task for the automatic annotation of genomes. Multiple software and approaches exist for this task.

In this project, we implemented a simple approach to predict prokaryotic genes based on the detection of reading frames and the Shine-Dalgarno motif. The objective of this project is to predict the genes of the reference genome of Listeria monocytogenes EGD-e (assembled and sequenced by the Institut Pasteur), which presents 2867 genes.

Clone the Repository

To download the project to your local machine:

git clone https://github.com/assadiab/geneprediction-tp.git
cd geneprediction-tp

Dependency Installation

This project uses Pixi as an environment and dependency manager. All dependencies are listed in the pixi.toml file.

To install or update the environment:

pixi install   # Installs dependencies if not already installed
pixi update    # Updates the environment according to pixi.toml

Basic usage

After cloning the repository and navigating to the project directory, update the Pixi environment (see Dependencies Installation), then list the files to verify everything is present:

ls

You should see:

โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ data/
โ”œโ”€โ”€ gpred/
โ”œโ”€โ”€ pixi.lock
โ”œโ”€โ”€ pixi.toml
โ”œโ”€โ”€ results/
โ””โ”€โ”€ tests/

The main script is gpred.py. It runs from the terminal with the following arguments:

pixi run python gpred/gpred.py -i data/listeria.fna -p results2/predict_genes.csv -o results2/genes.fna

โš ๏ธ The output folder (results2/) will be automatically created if it doesn't exist.

Results from gpred can be compared with:

Comparison between gpred and Prodigal

Available arguments:

  -h, --help            show this help message and exit
  -i GENOME_FILE        Complete genome file in fasta format
  -g MIN_GENE_LEN       Minimum gene length to consider (default 50).
  -s MAX_SHINE_DALGARNO_DISTANCE
                        Maximum distance from start codon where to look for a Shine-Dalgarno motif (default 16)
  -d MIN_GAP            Minimum gap between two genes (shine box not included, default 40).
  -p PREDICTED_GENES_FILE
                        Tabular file giving position of predicted genes
  -o FASTA_FILE         Fasta file giving sequence of predicted genes

Contact

En cas de questions, vous pouvez me contacter par email: assa.diabira@etu.u-paris.fr .

About

๐Ÿ”ฎ๐Ÿงฌ Gene prediction tool

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages