Working with alleles from highly polymorphic genes, like those in HLA and KIR clusters, is already hard enough! This repo is a collection of tools to facilitate your work on the manipulation and analysis of allele data sets.
The tools in this repo are sorted by category:
- genotype: a set of tools to facilitate the genotyping process.
- convert: a group commands that convert allele data between different file formats. Including vcf, csv and our own format .alt (from allele table). You could also convert from vcf to files compatible with pyHLA and PyPop.
- refactor: some commands to normalize allele resolutions and other useful refactoring.
Warning
This project is currently under development in alpha stage. Some things might break. If that is the case, please do not hesitate to file an issue (clicking here)
To install this package you can use pip:
pip install pip@git+https://github.com/nmendozam/alleleTools.gitIt will install the altools command in your current environment. So, to
execute a command you need to specify three things altools [tool_category] [tool_name] [input].
altools convert vcf2allele input.vcfClick to expand
To convert a genotype file to vcf, you can use the command allele2vcf. It
will append the genotyped alleles to a vcf file.
altools convert allele2vcf resources/hla_diversity.txt \
--loci_file resources/gene_locations.tsv \
--vcf file_to_append_to.vcfThe input format is a tab-separated file, where the first column is the sample name and pairs of columns for each gene. The header gene name convention is "gene" + "gene.1". e.g.
"id" "sbgroup" "A" "A.1"
"sample1" "CEPH" "03:01" "02:01"
Additionally the script requires a list of gene locations. The file should be tab-separated with the following format:
gene start
HFE 6:26087441
HLA-A 6:29942554
The first column is the gene name and the second column is (chromosome):(position). This position data can be found in ensembl or UCSC. The sample file used in this repo was obtained from a post in IPD-IMGT
This is a file with the known SNPs and the header. The script will append the
genotyped alleles to this file. The header should contain the gene names in the
same format as the gene location list. You need to assure that the file
contains ONLY the samples in the genotype file and no more. Otherwise the
concatenated alleles won't match the header. To filter the samples you can use
bcftools:
cut -d' ' -f1 resources/hla_diversity.txt | tail -n +2 | tr -d '"' | uniq > samples_id.txt
bcftools view --force-samples -S samples_id.txt test.vcf > filtered.vcfConverting from vcf to a genotype table is also useful. For example when the
alleles are imputed the output is a vcf file. To convert it to a genotype table
you can use the command vcf2alleles. The vcf file should be filtered to
contain only the HLA genes. You can use bcftools to do this. The script will
output a .pyhla file that can be used with
pyHLA and
PyPop. The phenotype file is optional should
follow the .phe format of plink files.
bcftools view --include 'ID~"HLA"' raw_imputed.vcf > only_hla.vcf
altools convert vcf2alleles only_hla.vcf --phe input.phe --out output.pyhlaThis script normalizes allele resolutions to a uniform level of the input file, facilitating association analyses. It ensures that alleles, such as 01 and 01:01, which are essentially identical, are recognized as equal by renaming them for consistency.
- resolution 1:
- 01:01 -> 01
- 01 -> 01
- 02:03 -> 02
- resolution 2:
- 01:01 -> 01:01
- 01 -> 01:01
- 02:03 -> 02:03
- resolution 3:
- 01:01 -> 01:01:01
- 01 -> 01:01:01
- 02:03 -> 02:03:01
If an output file name is not provided, it will be named
*.[resolution]fields.tsv, containing the resolved alleles. Up to three
resolution levels are supported (one, two and three).
bash src/alleleTools/refactor/allele_resolution.sh one file1.tsv file2.tsvThis script is used to generate a consensus HLA genotype from the result of many HLA genotyping algorithms. It will generate a vcf file or an allele table.
altools genotype consensus --input "IKMB_Reports/*.json" \
--output "output.txt" \
--format pyhlaThe input files follow the format of reports generated by the ikmb HLA
genotyping pipeline. These report files should be
in a folder called IKMB_Reports/ in a json format.