feat: implement boolean search parser for dandiset queries#2631
feat: implement boolean search parser for dandiset queries#2631bendichter wants to merge 3 commits intomasterfrom
Conversation
Add SearchParser class supporting advanced search syntax including: - Boolean operators (AND, OR, NOT) for combining search terms - Quoted phrases for exact matching - Parentheses for grouping and precedence control - Implicit AND between adjacent search terms The parser constructs Django Q objects from search queries, replacing the previous simple word-splitting approach with a more powerful query language that enables complex filtering expressions.
…kahead - Update SearchParser to properly detect end of input by looking ahead for non-operator content before continuing to parse terms - Fix handling of operator-only queries (e.g., 'AND OR NOT') to correctly return negation of empty Q object instead of empty Q - Add position saving/restoration logic to check for meaningful content ahead without consuming tokens - Update test expectations and documentation to reflect correct behavior when NOT operator has no following term This resolves edge cases where the parser would incorrectly continue parsing when only operators or closing parentheses remained in the input.
|
@bendichter WDYT to do more on this PR? at minimum it should be somehow accompanied with some "exposure" to the user that such bool parsing exists. E.g. for https://registry.datalad.org/overview/ we have explicit button to show the syntax. Adding some e2e test exposure to ensure that it all works IMHO would also be great! |
|
I would be happy to include a blog post and an information button that outlines the capabilities but I wanted to hold off on that until I got feedback about whether we wanted to include this functionality |
|
well, I would like to get a better search capability. we had been in talks and works about that for long time. This seems to be sensible incremental improvement, not a radical change so IMHO would be great to see. Specifically on the PR though -- since you are trying to work out a parser, I think the best would be to give that job to a parser library! Ad-hoc code has danger to be brittle even if "looks simple" and would be hard-to-impossible to extend. Abstracting away into parser syntax could help to elevate that. In datalad-registry with @candleindark we used https://github.com/lark-parser/lark and managed to translate into sqlalchemy queries. |
Add SearchParser class supporting advanced search syntax including:
The parser constructs Django Q objects from search queries, replacing the previous simple word-splitting approach with a more powerful query language that enables complex filtering expressions.
This would be accompanied by a new page in docs.dandiarchive.org. Something like:
DANDI Archive Search Guide
The DANDI Archive search supports boolean operators and phrase matching to help you find exactly the datasets you need. This guide explains how to construct effective search queries.
Simple Searches
Just type what you're looking for. The search will find datasets that contain all of your terms:
This finds datasets that mention both "hippocampus" and "mouse" anywhere in their metadata.
Exact Phrases
Put quotes around phrases you want to match exactly:
This finds only datasets that contain this exact phrase, not datasets that just happen to mention "two", "photon", "calcium", and "imaging" separately.
Boolean Operators
OR - Find Either Term
Use
ORwhen you want datasets that match any of several alternatives:This finds datasets about mice, datasets about rats, or datasets about both.
This finds datasets studying any of these three brain regions.
NOT - Exclude Terms
Use
NOTto exclude datasets containing certain terms:This finds calcium imaging datasets but excludes any that mention anesthesia.
To exclude multiple terms, you can either chain them:
Or group them with
OR:Both approaches exclude datasets mentioning either "lesion" or "drug".
AND - Require All Terms
You rarely need to type
ANDbecause it's automatic. These two queries are identical:Both find datasets that mention all three terms.
Grouping with Parentheses
Use parentheses to control how operators combine:
This finds hippocampus datasets from either mice or rats.
This finds calcium imaging datasets from hippocampus or cortex, excluding lesion studies.
You can nest parentheses for complex queries:
This finds hippocampus datasets from mice or rats, excluding those involving lesions or anesthesia.
Operator Precedence
When you don't use parentheses, operators are evaluated in this order:
NOT(highest priority)AND(implicit between adjacent terms)OR(lowest priority)This means:
Is interpreted as:
To get what you probably meant, use parentheses:
Best practices:
ORwith other operators, use parentheses to make your intent clear.