SMSD 6.11.0 — Algorithms & Design

Author: Syed Asad Rahman, BioInception PVT LTD Version: 6.11.0 | Toolkit: CDK 2.11 | Language: Java 21+ / C++17 / Python 3.9+ (recommended 3.12)

Abstract

SMSD provides exact substructure and maximum common substructure (MCS) search for chemical graphs using:

VF2++ backtracking with NLF pruning and bit-parallel domain filtering for substructure search
Modular product + maximum clique (Bron-Kerbosch with RRSplit pivoting) for MCS
RASCAL screening for fast similarity upper-bound filtering
FASTiso ordering for accelerated candidate selection during backtracking
Greedy probing to seed and bound clique search
McGregor extension to grow clique seeds into larger common subgraphs
MolGraph precomputation eliminating all CDK hot-path overhead
Canonical labeling via Bliss-style individualization-refinement
Canonical SMILES generation (deterministic, CDK-free)
Orbit-based automorphism for atom equivalence classes
Recursive SMARTS predicates with named registry
Stereo-aware matching (E/Z double bonds, tetrahedral chirality)

1. Background

The original SMSD toolkit (Rahman et al., J. Cheminformatics, 2009) established a multi-strategy approach to MCS by integrating VF-family graph matching with CDK routines. Version 3.0 is a ground-up rewrite that retains the algorithmic foundations while modernizing the implementation for performance and correctness on CDK 2.11.

Key algorithmic references:

VF2/VF2++: Cordella et al. (2004); Juttner & Madarasi (2018)
Maximum clique: Bron & Kerbosch (1973); Tomita et al. (2006)
RRSplit pivoting: Sun & Luo (2025)
RASCAL screening: Raymond & Willett (2002)
FASTiso ordering: Vu & Ahn (2025)
McGregor extension: McGregor (1982)
CDK: Willighagen et al., J. Cheminformatics (2017)

2. MolGraph Precomputation

All CDK object access is performed once at construction time. The MolGraph class pre-builds:

Data	Type	Purpose
`atomicNum[]`	`int[]`	Atomic number per atom
`formalCharge[]`	`int[]`	Formal charge per atom
`label[]`	`int[]`	Encoded label: `(Z<<2)\|(aromatic<<1)\|ring`
`ring[]`, `aromatic[]`	`boolean[]`	Ring/aromaticity flags
`degree[]`	`int[]`	Vertex degree
`neighbors[][]`	`int[][]`	Adjacency lists (primitive arrays)
`bonds[][]`	`IBond[][]`	O(1) bond lookup matrix
`adj[]`	`BitSet[]`	Adjacency bitsets for pruning

All subsequent operations (NLF building, feasibility checks, compatibility tests, MCS seed construction) operate on these primitive arrays with zero CDK method calls in inner loops.

3. Chemistry Constraints

ChemOptions governs matching behaviour:

Atom matching: element type, formal charge, aromaticity mode (strict/flexible), ring-only constraint
Bond matching: strict/loose/any bond order, ring-size guard, geometric (E/Z) stereo via DoubleBondStereochemistry
Chirality: tetrahedral stereo via ITetrahedralChirality
Pruning: 1/2/3-hop NLF (neighbor label frequency), bit-parallel adjacency feasibility

Constraints are checked incrementally during VF2++ backtracking.

4. Substructure Search (VF2++)

The VF2PPMatcher implements VF2++ with enhancements:

Atom ordering: BFS from highest-scoring atom (aromatic > ring > degree) with degree-sorted neighbor expansion
Frontier pruning: target candidates restricted to frontier set of already-matched neighbors
Multi-level NLF: 1-hop (direct neighbors), 2-hop, and 3-hop label frequency comparison eliminates incompatible candidates early
Bit-parallel feasibility: BitSet adjacency subset test
O(1) bond lookup: pre-built IBond[][] matrix
Reusable scratch: BitSets cleared and reused instead of allocated

The VF2Matcher (baseline) uses simpler neighbor-based candidate selection without frontier tracking.

5. MCS via Modular Product + Maximum Clique

Modular Product Construction

Each node represents a compatible (query_atom, target_atom) pair. Edges connect nodes whose pairwise bond relationships are consistent:

MCCS (connected): edges only when both pairs have compatible bonds
MCIS (induced): edges also when neither pair has a bond

Maximum Clique (Bron-Kerbosch with RRSplit)

The clique finder uses Bron-Kerbosch with RRSplit pivoting (Sun & Luo, 2025), which applies a reduced-recursion split strategy to minimize redundant branching. This is combined with Tomita-style pivoting and degree-ordered start vertices. Multiple seed strategies run in parallel under time budget:

Clique seed: modular product maximum clique
Ring anchor seed: match ring atoms by aromatic/degree score
Label-degree anchor seed: match by rarest label, highest degree
VF2++ ring skeleton seed: substructure match on ring skeletons
VF2++ core seed: substructure match on heavy-atom core

McGregor Extension

Each seed is extended by bounded BFS augmentation with NLF filtering. Best result across all seeds is retained. Post-processing applies:

Induced subgraph pruning (if requested)
Largest connected component extraction
Ring-anchor guard (prevents off-ring inflation)

6. Optimizations (v5.2 onwards)

6.1 Bit-Parallel Domain Filtering

Candidate domains for each query atom are represented as BitSet vectors. Adjacency constraints are enforced via bitwise AND operations, pruning impossible assignments before backtracking begins. This eliminates large portions of the search space in O(n/w) time per constraint check, where w is the machine word size.

6.2 Bron-Kerbosch with RRSplit Pivoting

The RRSplit strategy (Sun & Luo, SIGMOD 2025) partitions the candidate set at each recursion level to reduce redundant subtree exploration. By splitting candidates into reduced and recursive subsets, this approach achieves tighter pruning than standard Tomita pivoting, particularly on dense modular product graphs common in MCS computation.

6.3 RASCAL Screening

Before committing to full MCS computation, SMSD applies RASCAL-based upper-bound estimation (Raymond & Willett, 2002). This computes a fast similarity ceiling from the modular product graph structure. Pairs that cannot exceed the current best similarity are skipped entirely, providing significant speedups on batch MCS workflows. Exposed via the SearchEngine.similarityUpperBound() API.

6.4 FASTiso Ordering

The FASTiso heuristic (Vu & Ahn, 2025) provides an improved candidate ordering for subgraph isomorphism. It assigns priorities based on neighborhood label diversity and connectivity patterns, producing orderings that lead to earlier pruning and fewer backtracks compared to simple degree-based orderings.

6.5 Greedy Probing

Before launching the full Bron-Kerbosch search, a greedy probing phase constructs an initial clique by iteratively selecting the highest-degree compatible vertex. This provides a non-trivial lower bound that prunes unproductive branches early in the exact search, and serves as a fallback result if the time budget expires.

6.6 Canonical Labeling (Bliss-Style Individualization-Refinement)

SMSD computes a canonical labeling for each MolGraph using a Bliss-style individualization-refinement algorithm (McKay & Piperno, 2014). The procedure:

Color refinement: iteratively partition atoms by their neighborhood label multisets until the partition stabilises (equitable colouring).
Individualization: when the partition is not discrete, select the first non-singleton cell, individualize each atom in it, and recurse.
Canonical leaf selection: among all discrete partitions reached, select the lexicographically smallest as the canonical form.
Automorphism pruning: detect automorphisms during the search tree traversal and prune equivalent branches.

The result is a canonical relabeling canonicalLabel[i] giving the canonical index for each atom i, and a canonicalHash (64-bit) derived from the canonical adjacency encoding. Two molecules are isomorphic if and only if their canonical hashes match, enabling O(1) graph isomorphism checks.

Exposed via MolGraph.getCanonicalLabeling() and MolGraph.getCanonicalHash().

6.7 Canonical SMILES Generation

MolGraph.toCanonicalSmiles() produces a deterministic canonical SMILES string directly from the MolGraph representation, with no CDK dependency. The algorithm:

Applies the canonical labeling to determine atom visit order.
Performs a DFS traversal in canonical order, emitting atoms, branches (parentheses), and ring closures.
Handles disconnected components via dot-separated notation.
Encodes aromaticity (lowercase letters), charges, and hydrogen counts.

Because the traversal follows the canonical labeling, the output is identical for all isomorphic input representations.

6.8 Orbit Computation and Automorphism Pruning

During canonical labeling, the algorithm accumulates automorphism generators. Each generator permutes atom indices while preserving adjacency and labels. The orbit of an atom is the set of all atoms reachable from it under the automorphism group.

MolGraph.getOrbits() returns an array where orbit[i] is the smallest atom index in the equivalence class of atom i. Atoms with the same orbit value are symmetrically equivalent. This information is useful for:

Symmetry-aware enumeration: avoid redundant atom mappings
Reaction site prediction: identify equivalent functional groups
Canonicalization verification: atoms in the same orbit are interchangeable

7. Public API

SMSD provides three API tiers, each with different trade-offs:

Tier 1: SMSD facade (convenience, auto-standardises)

new SMSD(query, target, opts) — standardises molecules automatically
new SMSD(smiles1, smiles2, opts) — parses SMILES + standardises
smsd.isSubstructure(), smsd.findMCS(), smsd.similarityUpperBound()

Tier 2: SMSD facade (pre-standardised, skip preprocessing)

new SMSD(query, target, opts, false) — molecules used as-is
Same API as Tier 1, but 2-5x faster construction

Tier 3: SearchEngine (zero overhead, full control)

SearchEngine.isSubstructure(q, t, opts, timeoutMs) — substructure
SearchEngine.findMCS(q, t, opts, mcsOpts) — MCS with full options
SearchEngine.similarityUpperBound(q, t, opts) — RASCAL screening
SearchEngine.batchMCS(molecules, threshold, mcsOpts) — batch MCS
SearchEngine.findDisconnectedMCS(q, t, opts, mcsOpts) — disconnected MCS
No standardisation, no object creation overhead

Standardisation (Standardiser.standardise()) is a CDK preprocessing step independent of the matching algorithms. Users with their own preprocessing pipelines should use Tier 2 or 3.

Tier 4: MolGraph.Builder (CDK-free, toolkit-agnostic)

MolGraph.Builder constructs molecules from raw primitive arrays
No CDK dependency required at runtime
Enables adapters for OpenBabel, RDKit, Schrödinger, or any toolkit
SearchEngine.isSubstructure(MolGraph, MolGraph, opts, timeout)
SearchEngine.findMCS(MolGraph, MolGraph, opts, mcsOpts)

MolGraph Canonical Operations

MolGraph.getCanonicalLabeling() — canonical atom permutation (Bliss-style)
MolGraph.getCanonicalHash() — 64-bit canonical hash for O(1) isomorphism
MolGraph.toCanonicalSmiles() — deterministic canonical SMILES (CDK-free)
MolGraph.getOrbits() — automorphism orbit equivalence classes

Architecture: CDK is one adapter via new MolGraph(IAtomContainer). Users can write their own adapter using MolGraph.Builder for any toolkit.

8. SMARTS Support

Standardiser.matchAll(smarts, target) matches SMARTS via CDK's SmartsPattern (direct API, no reflection)
SmartPredicateProcessor expands named predicates ($isKetone, $isAmideN, etc.) from PredicateRegistry before matching
Recursive $(SMARTS) blocks are anchored to candidate atoms

9. Performance

Benchmarks (best-of-500 substructure / best-of-100 MCS, JIT-warmed):

Pair	Substructure	MCS
Benzene / Naphthalene	97 us	471 us
Aspirin / Acetaminophen	90 us	608 us
Morphine / Codeine	153 us	1,745 us
C15 / C20 chain	104 us	1,286 us
Triphenylene / Chrysene	191 us	2,496 us

Both substructure and MCS are NP-hard in the worst case. Practical performance is achieved through strong NLF pruning, bit-parallel feasibility, RASCAL screening, FASTiso ordering, greedy probing, RRSplit pivoting, frontier-based candidate selection, and time budgeting.

10. Test Suite

The project includes 420 JUnit 5 tests covering heterocycles, reactions, drug pairs, stereo configurations, SMARTS predicates, batch workflows, adversarial edge cases, large molecules (100-1000+ atoms), nucleotide MCS pairs, and Kekule/aromatic equivalence across all supported molecule formats.

11. References

Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. Small Molecule Subgraph Detector (SMSD) toolkit. J. Cheminformatics, 1:12, 2009.
Willighagen EL, Mayfield JW, Alvarsson J, et al. The Chemistry Development Kit (CDK) v2.0. J. Cheminformatics, 9:33, 2017.
Cordella LP, Foggia P, Sansone C, Vento M. A (sub)graph isomorphism algorithm for matching large graphs. IEEE TPAMI, 26(10):1367-1372, 2004.
Tomita E, Tanaka A, Takahashi H. The worst-case time complexity for generating all maximal cliques. Theoretical Computer Science, 363(1):28-42, 2006.
McGregor JJ. Backtrack search algorithms and the maximal common subgraph problem. Software Practice and Experience, 12(1):23-34, 1982.
Raymond JW, Willett P. Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J. Computer-Aided Molecular Design, 16(7):521-533, 2002.
Sun S, Luo S. RRSplit: Efficient Maximum Clique Finding via Reduced-Recursion Splitting. Proc. ACM SIGMOD, 2025.
Vu XT, Ahn HK. FASTiso: A Faster Algorithm for Subgraph Isomorphism. Proc. VLDB Endowment, 2025.
Sun S, Luo S. SIGMo: Subgraph Isomorphism on GPU via Maximum Clique. Proc. SC (Supercomputing), 2025.
McKay BD, Piperno A. Practical Graph Isomorphism, II. J. Symbolic Computation, 60:94-112, 2014. (Nauty/Bliss canonical labeling foundations)
Thakkar A, et al. SynKit: A Comprehensive Toolkit for Synthesis Planning. J. Chemical Information and Modeling (JCIM), 2025.
Thakkar A, et al. Rxn-INSIGHT: Fast Chemical Reaction Analysis Using Bond-Electron Matrices. J. Cheminformatics, 16:37, 2024.
Zhong Z, et al. FlowER: Flow-based Enzymatic Retrosynthesis. arXiv preprint, 2025.
Zhong Z, et al. SynRXN: Synthesizability-Aware Retrosynthesis with Reaction Networks. arXiv preprint, 2026.
McKay BD, Piperno A. Nauty and Traces, version 2.9.3. 2026. https://pallini.di.uniroma1.it/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMSD 6.11.0 — Algorithms & Design

Abstract

1. Background

2. MolGraph Precomputation

3. Chemistry Constraints

4. Substructure Search (VF2++)

5. MCS via Modular Product + Maximum Clique

Modular Product Construction

Maximum Clique (Bron-Kerbosch with RRSplit)

McGregor Extension

6. Optimizations (v5.2 onwards)

6.1 Bit-Parallel Domain Filtering

6.2 Bron-Kerbosch with RRSplit Pivoting

6.3 RASCAL Screening

6.4 FASTiso Ordering

6.5 Greedy Probing

6.6 Canonical Labeling (Bliss-Style Individualization-Refinement)

6.7 Canonical SMILES Generation

6.8 Orbit Computation and Automorphism Pruning

7. Public API

Tier 1: SMSD facade (convenience, auto-standardises)

Tier 2: SMSD facade (pre-standardised, skip preprocessing)

Tier 3: SearchEngine (zero overhead, full control)

Tier 4: MolGraph.Builder (CDK-free, toolkit-agnostic)

MolGraph Canonical Operations

8. SMARTS Support

9. Performance

10. Test Suite

11. References

FilesExpand file tree

WHITEPAPER.md

Latest commit

History

WHITEPAPER.md

File metadata and controls

SMSD 6.11.0 — Algorithms & Design

Abstract

1. Background

2. MolGraph Precomputation

3. Chemistry Constraints

4. Substructure Search (VF2++)

5. MCS via Modular Product + Maximum Clique

Modular Product Construction

Maximum Clique (Bron-Kerbosch with RRSplit)

McGregor Extension

6. Optimizations (v5.2 onwards)

6.1 Bit-Parallel Domain Filtering

6.2 Bron-Kerbosch with RRSplit Pivoting

6.3 RASCAL Screening

6.4 FASTiso Ordering

6.5 Greedy Probing

6.6 Canonical Labeling (Bliss-Style Individualization-Refinement)

6.7 Canonical SMILES Generation

6.8 Orbit Computation and Automorphism Pruning

7. Public API

Tier 1: SMSD facade (convenience, auto-standardises)

Tier 2: SMSD facade (pre-standardised, skip preprocessing)

Tier 3: SearchEngine (zero overhead, full control)

Tier 4: MolGraph.Builder (CDK-free, toolkit-agnostic)

MolGraph Canonical Operations

8. SMARTS Support

9. Performance

10. Test Suite

11. References