Skip to content

nested multiprocessing in notebooks with mpire #29

@sergpolly

Description

@sergpolly

Consider demonstrating an example of parallel execution of some of the cooltools API functions for multiple samples - i.e. when an API function itself is using multiprocessing and we want to do it in the in the notebook ...

If one have a big multicore system (16 real cores and more) it is easy to run several CLI tasks in parallel for multiple samples, where each task itself is using several cores - i.e. is multiprocessed. Very often such multiprocessed operations does not scale well beyond 8-12 processes - so it is indeed more "economical" to process multiple samples at once with fewer cores each.

Now - what if we want to achieve the same but in the notebook ? It is not trivial to do so - because multiprocess does not allow nesting (the way we typically use it/out of the box). Now it can be easily done with MPIRE https://github.com/sybrenjansen/mpire , which allows running multiple multiprocessed task in parallel and its API is very similar to multiprocess itself ... Check it out:

mpire test:

from mpire import WorkerPool
clrs  # dictionary of several coolers
exp_kwargs = dict(view_df=hg38_arms, nproc=12)

def _job(sample):
        _clr = clrs[sample]
        _exp = cooltools.expected_cis( _clr, **exp_kwargs)
    return (sample, _exp)

# have to use daemon=False, because _job is multiprocessing-based already ...
# trying to run 8 samples in parallel, each using 12-processes - 8*12=96
with WorkerPool(n_jobs=8, daemon=False) as wpool:
    results = wpool.map(_job, telo_clrs, progress_bar=True)

# sort out the results ...
exps = {sample: _exp for sample, _exp in results}

# this takes ~1 min for 16 coolers @ 25kb on 56-core system (112 thread)

one-by-one using a ton of cores per task:

exp_kwargs = dict(view_df=hg38_arms, nproc=112)
exps = {}
for sample, _clr in clrs.items():
    print(f"calculating expected for {sample} ...")
        exps[sample] = cooltools.expected_cis( _clr, **exp_kwargs)

# this takes > 2 mins, and shows no time improvements after nproc=32 ...

this has limited application to projects with many samples and people with big workstations - but when those 2 criteria are both met - the speed up is very appreciated

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions