-
Notifications
You must be signed in to change notification settings - Fork 0
Understanding how methods work
Synteny methods are based on scripts developped by Fabien Degalez (Cross-species orthology detection of long non-coding RNAs (lncRNA) through 13 species using genomic and functional annotations. Degalez et al., 2024. bioRchiv).
The figure below is extracted from the paper and represents each synteny method:
- method 1 with 2 orthologous Protein Coding Genes (PCGs) as anchors surrounding lncRNA
- method 2 with one closest orthologous PCG as anchor with the same lncRNA classification according FEELnc classification (FEELnc_classifier.pl)
FEELnc classifier is initially performed on isoforms. The script FEELnc_tpLevel2gnLevelClassification.R developped by Degalez et al. allows to obtain lncRNA classification at gene level, as explained here:
Legend
This classification is found in detailed output: scans_results/method2/syntenyByPairFeelnc/shortNameSP1-shortNameSP2_lncConfigurationHomologyAggregated.tsv
- lncg = long genic non coding gene with lnc/PCG distance inferior to intergenic/genic threshold
- linc = long intergenic non coding gene with lnc/PCG distance superior to intergenic/genic threshold
- SS.up/.dw = same strand up/down
- Conv => lnc and PCG are on different strands and convergent
- Divg => lnc and PCG are on different strands and divergent
The method 3 based on sequence alignment produces several output files per analyzed species pair:
- liftoff output files:
liftoff_species1_to_species2_flankX.gtfwith and without filtering - alignment_analysis output files:
mapped_knownGenes.txt,mapped_unknownGenes.txtandunmapped_genes.txt(see Figure below to explain these files)
- liftoff output figure to visualize sequence alignments in term of coverage relative to sequence identity
Several options are available:
-
biotypeto analyze lncRNA only, mRNA only or both -
flankto determine amount of flanking sequence to align as a fraction of gene length. This can improve gene alignment where gene structure differs between target and reference (liftoff option) -
coverageto set cut-off on coverage -
identityto set cut-off on sequence identity
Coverage and identity are liftoff options. Liftoff output file liftoff_species1_to_species2_flank0.gtf contains all date without cut-off while the filtered.gtf file contains all results according coverage and identity cut-off settled and extra copy are removed.
Bedtools intersect is used with default fraction set at ~1bp. The option to custom this value is not available for now but the overlap fraction is indicated in mapped_knownGenes.txt files (see section)