Skip to content

Migration of gene_sets (formerly target_facets) computation from Scala to PySpark#50

Open
polrus wants to merge 1 commit intomainfrom
feature/facets-migration
Open

Migration of gene_sets (formerly target_facets) computation from Scala to PySpark#50
polrus wants to merge 1 commit intomainfrom
feature/facets-migration

Conversation

@polrus
Copy link
Contributor

@polrus polrus commented Dec 1, 2025

"Facets" are renamed to "gene_sets" to better reflect their meaning as collections of gene identifiers grouped by biological attributes.

New PySpark implementation: src/pts/pyspark/gene_sets/:

  • gene_sets.py - Main computation logic
  • helpers.py - Utility functions
  • propagation/ - Entity ID propagation algorithms

New feature: Implements transitive propagation of entity IDs from child terms to parent terms in hierarchical ontologies (GO, Reactome, ChEMBL Target Class).

Function processes leaf nodes (no children) first, merges their entityIds into parents, removes processed edges, and repeats until complete.
Because each iteration adds to Spark’s lazy lineage, long iterative runs can trigger stack overflows. The Session class now sets up a checkpoint directory (from spark.checkpoint.dir or a temporary folder) so iterative algorithms like entityId propagation can checkpoint intermediate results. Checkpointing writes DataFrames to disk and breaks the lineage, preventing graph overgrowth.

Testing:src/test/test_gene_sets/ (execution time ~2 minutes):

  • test_prepare_dataset.py - Tests dataset preparation for propagation (7 tests)
  • test_merge_propagated.py - Tests merging propagated results (6 tests)
  • test_propagate_with_prep.py - Tests end-to-end propagation (3 tests)

@polrus polrus changed the title Migration of **gene sets (formerly "facets")** computation from Scala to PySpark Migration of gene_sets (formerly target_facets) computation from Scala to PySpark Dec 1, 2025
@polrus polrus requested a review from javfg December 3, 2025 15:52
@javfg javfg force-pushed the feature/facets-migration branch from fb730e6 to cd17afe Compare February 24, 2026 14:26
Co-authored-by: Polina Rusina <polina@ebi.ac.uk>
@javfg javfg force-pushed the feature/facets-migration branch from cd17afe to e3a8ab1 Compare February 24, 2026 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants