-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Broadly speaking, the current system for determining mismatches can be thought of as two nested for-loops:
cur_min_hamming_distance = 10**5
cur_min_hamming_index = -1
for seq_under_test in seqs_under_test:
for i, ref_seq in enumerate(ref_seqs):
hamming_distance = numpy(ref_seq) ^ numpy(seq_under_test)
if hamming_distance < cur_min_hamming_distance:
cur_min_hamming_distance = hamming_distance
cur_min_hamming_index = iAs experienced in actual use, this is a pretty slow process. We can take a few approaches to address this:
Parallelize computation
This system is embarrassingly parallelizable, the distance between a sequence under test and a reference sequence is not dependant on any external process and we are just performing this calculation AxB times looking for global minima.
We can partition the reference sequence set into smaller subsets (for arguments sake lets say 500) and then fire them off into Asyncio TaskGroups
Vectorize operations using Pandas/PolaRS
Once again exploiting the fact that the heavy computation is only necessary between two values we can convert this to a tabular data format with the final result looking something like:
| allele | ref_exon2 | ref_exon3 | ref_exon2_bin | ref_exon3_bin | seq0_exon2_distance | seq0_exon3_distance | .... | seqN_exon2_distance | seqN_exon3_distance |
|---|---|---|---|---|---|---|---|---|---|
| allele01 | ACGTN... | ACGTN.... | ndarray(1,2,4,8,15,...) | ndarray(1,2,4,8,15,...) | 0 | 0 | ... | 4 | 18 |
| .... | ... | ... | .... | .... | ... | ... | ... | ... | ... |
| alleleN | ACCTN... | ACCTN.... | ndarray(1,2,2,8,15,...) | ndarray(1,2,2,8,15,...) | 54 | 3 | ... | 5 | 22 |
Starting with our current alleles reference file, we would ingest the {allele: list[str], exon2: str, exon3: str} into the dataframe, resulting in
| allele | ref_exon2 | ref_exon3 |
|---|---|---|
| ["alleleA", "alleleB", ...] | ACGTN.... | ACGTN... |
We can then apply the nuc2bin method on the sequence:
df["ref_exon2_bin"] = df["ref_exon2"].apply(nuc2bin)
df["ref_exon3_bin"] = df["ref_exon3"].apply(nuc2bin)Followed by a df.explode on the allele column to convert that into individual alleles. Note, usage of explode is optional as that is going to increase the number of computations for no real benefit.
We can then convert our string sequences under test to binary format using nuc2bin and do a vector comparison operation:
import functools
@functools.cache
def compare_nuc_bins(ref: Any, seq_under_test: Any) -> int:
# calculate distance here
for i, seq in enumerate(seqs_under_test):
exon2 = nuc2bin(seq.exon2)
exon3 = nuc2bin(seq.exon3)
df[f"seq{i:04}_exon2_distance"] = df["ref_exon2_bin"].apply(compare_nuc_bins, seq_under_test=exon2)
df[f"seq{i:04}_exon3_distance"] = df["ref_exon3_bin"].apply(compare_nuc_bins, seq_under_test=exon3)We can then take the minimum with df.loc:
closest_match_exon2 = df.loc[df[f'seq{i:04}_exon2_distance'] == df[f'seq{i:04}_exon2_distance'].min() , "allele"]
closest_match_exon3 = df.loc[df[f'seq{i:04}_exon3_distance'] == df[f'seq{i:04}_exon3_distance'].min() , "allele"]IO operations
Because we are reading everything from a file and then computing the bin2nuc at run time, it may be beneficial to cache the computed values to a file using apache parquet, pandas supports writing to parquet out of the box. We should be easily able to detect whether we've already computed the values for a given data set. Please compare the time it takes to compute all nuc2bin arrays vs the cost of reading in previously computed data.