Skip to content

Vectorize computation operations #16

@va7eex

Description

@va7eex

Broadly speaking, the current system for determining mismatches can be thought of as two nested for-loops:

cur_min_hamming_distance = 10**5
cur_min_hamming_index = -1

for seq_under_test in seqs_under_test:
    for i, ref_seq in enumerate(ref_seqs):
        hamming_distance = numpy(ref_seq) ^ numpy(seq_under_test)
        if hamming_distance < cur_min_hamming_distance:
            cur_min_hamming_distance = hamming_distance
            cur_min_hamming_index = i

As experienced in actual use, this is a pretty slow process. We can take a few approaches to address this:

Parallelize computation

This system is embarrassingly parallelizable, the distance between a sequence under test and a reference sequence is not dependant on any external process and we are just performing this calculation AxB times looking for global minima.

We can partition the reference sequence set into smaller subsets (for arguments sake lets say 500) and then fire them off into Asyncio TaskGroups

Vectorize operations using Pandas/PolaRS

Once again exploiting the fact that the heavy computation is only necessary between two values we can convert this to a tabular data format with the final result looking something like:

allele ref_exon2 ref_exon3 ref_exon2_bin ref_exon3_bin seq0_exon2_distance seq0_exon3_distance .... seqN_exon2_distance seqN_exon3_distance
allele01 ACGTN... ACGTN.... ndarray(1,2,4,8,15,...) ndarray(1,2,4,8,15,...) 0 0 ... 4 18
.... ... ... .... .... ... ... ... ... ...
alleleN ACCTN... ACCTN.... ndarray(1,2,2,8,15,...) ndarray(1,2,2,8,15,...) 54 3 ... 5 22

Starting with our current alleles reference file, we would ingest the {allele: list[str], exon2: str, exon3: str} into the dataframe, resulting in

allele ref_exon2 ref_exon3
["alleleA", "alleleB", ...] ACGTN.... ACGTN...

We can then apply the nuc2bin method on the sequence:

df["ref_exon2_bin"] = df["ref_exon2"].apply(nuc2bin)
df["ref_exon3_bin"] = df["ref_exon3"].apply(nuc2bin)

Followed by a df.explode on the allele column to convert that into individual alleles. Note, usage of explode is optional as that is going to increase the number of computations for no real benefit.

We can then convert our string sequences under test to binary format using nuc2bin and do a vector comparison operation:

import functools

@functools.cache
def compare_nuc_bins(ref: Any, seq_under_test: Any) -> int:
    # calculate distance here

for i, seq in enumerate(seqs_under_test):
    exon2 = nuc2bin(seq.exon2)
    exon3 = nuc2bin(seq.exon3)
    df[f"seq{i:04}_exon2_distance"] = df["ref_exon2_bin"].apply(compare_nuc_bins, seq_under_test=exon2)
    df[f"seq{i:04}_exon3_distance"] = df["ref_exon3_bin"].apply(compare_nuc_bins, seq_under_test=exon3)

We can then take the minimum with df.loc:

closest_match_exon2 = df.loc[df[f'seq{i:04}_exon2_distance'] == df[f'seq{i:04}_exon2_distance'].min() , "allele"]
closest_match_exon3 = df.loc[df[f'seq{i:04}_exon3_distance'] == df[f'seq{i:04}_exon3_distance'].min() , "allele"]

IO operations

Because we are reading everything from a file and then computing the bin2nuc at run time, it may be beneficial to cache the computed values to a file using apache parquet, pandas supports writing to parquet out of the box. We should be easily able to detect whether we've already computed the values for a given data set. Please compare the time it takes to compute all nuc2bin arrays vs the cost of reading in previously computed data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions