Vectorize computation operations

Broadly speaking, the current system for determining mismatches can be thought of as two nested for-loops:

```py
cur_min_hamming_distance = 10**5
cur_min_hamming_index = -1

for seq_under_test in seqs_under_test:
    for i, ref_seq in enumerate(ref_seqs):
        hamming_distance = numpy(ref_seq) ^ numpy(seq_under_test)
        if hamming_distance < cur_min_hamming_distance:
            cur_min_hamming_distance = hamming_distance
            cur_min_hamming_index = i
```

As experienced in actual use, this is a pretty slow process. We can take a few approaches to address this:

## Parallelize computation

This system is embarrassingly parallelizable, the distance between a sequence under test and a reference sequence is not dependant on any external process and we are just performing this calculation AxB times looking for global minima.

We can partition the reference sequence set into smaller subsets (for arguments sake lets say 500) and then fire them off into [Asyncio TaskGroups](https://docs.python.org/3/library/asyncio-task.html#task-groups)

## Vectorize operations using Pandas/PolaRS

Once again exploiting the fact that the heavy computation is only necessary between two values we can convert this to a tabular data format with the final result looking something like:

| allele | ref_exon2 | ref_exon3 | ref_exon2_bin | ref_exon3_bin | seq0_exon2_distance | seq0_exon3_distance | .... | seqN_exon2_distance | seqN_exon3_distance |
| --- | --- | ---- | --- | --- | --- | --- | --- | --- | --- |
| allele01 | ACGTN... | ACGTN.... | ndarray(1,2,4,8,15,...) | ndarray(1,2,4,8,15,...) | 0 | 0 | ... | 4 | 18 |
| .... |  ... | ... | .... | .... | ... | ... | ... | ... | ... |
| alleleN | ACCTN... | ACCTN.... | ndarray(1,2,2,8,15,...) | ndarray(1,2,2,8,15,...) | 54 | 3 | ... | 5 | 22 |

Starting with our current [alleles reference file](https://raw.githubusercontent.com/cfe-lab/hla_algorithm/refs/heads/main/src/hla_algorithm/default_data/hla_standards.yaml), we would ingest the `{allele: list[str], exon2: str, exon3: str}` into the dataframe, resulting in 

| allele | ref_exon2 | ref_exon3 |
| --- | --- | --- |
| ["alleleA", "alleleB", ...] | ACGTN.... | ACGTN... |

We can then [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) the `nuc2bin` method on the sequence:

```py
df["ref_exon2_bin"] = df["ref_exon2"].apply(nuc2bin)
df["ref_exon3_bin"] = df["ref_exon3"].apply(nuc2bin)
```

Followed by a [df.explode](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) on the `allele` column to convert that into individual alleles. Note, usage of `explode` is optional as that is going to increase the number of computations for no real benefit.

We can then convert our string sequences under test to binary format using `nuc2bin` and do a vector comparison operation:

```py
import functools

@functools.cache
def compare_nuc_bins(ref: Any, seq_under_test: Any) -> int:
    # calculate distance here

for i, seq in enumerate(seqs_under_test):
    exon2 = nuc2bin(seq.exon2)
    exon3 = nuc2bin(seq.exon3)
    df[f"seq{i:04}_exon2_distance"] = df["ref_exon2_bin"].apply(compare_nuc_bins, seq_under_test=exon2)
    df[f"seq{i:04}_exon3_distance"] = df["ref_exon3_bin"].apply(compare_nuc_bins, seq_under_test=exon3)
```

We can then take the minimum with [df.loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html):

```py
closest_match_exon2 = df.loc[df[f'seq{i:04}_exon2_distance'] == df[f'seq{i:04}_exon2_distance'].min() , "allele"]
closest_match_exon3 = df.loc[df[f'seq{i:04}_exon3_distance'] == df[f'seq{i:04}_exon3_distance'].min() , "allele"]
```

### IO operations

Because we are reading everything from a file and then computing the bin2nuc at run time, it may be beneficial to cache the computed values to a file using [apache parquet](https://parquet.apache.org/), pandas supports [writing to parquet](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html) out of the box. We should be easily able to detect whether we've already computed the values for a given data set. Please compare the time it takes to compute all nuc2bin arrays vs the cost of reading in previously computed data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize computation operations #16

Parallelize computation

Vectorize operations using Pandas/PolaRS

IO operations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

allele	ref_exon2	ref_exon3	ref_exon2_bin	ref_exon3_bin	seq0_exon2_distance	seq0_exon3_distance	....	seqN_exon2_distance	seqN_exon3_distance
allele01	ACGTN...	ACGTN....	ndarray(1,2,4,8,15,...)	ndarray(1,2,4,8,15,...)	0	0	...	4	18
....	...	...	....	....	...	...	...	...	...
alleleN	ACCTN...	ACCTN....	ndarray(1,2,2,8,15,...)	ndarray(1,2,2,8,15,...)	54	3	...	5	22

Vectorize computation operations #16

Description

Parallelize computation

Vectorize operations using Pandas/PolaRS

IO operations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions