Parsek is a lightweight, pure-Python parser library packaged as a single source file with zero dependencies. It bridges the gap between regular expressions and heavyweight parser generators. Parsek is ideally suited for domain-specific languages (DSLs and mini DSLs), custom data formats, and plugin systems. Grammar rules are defined directly in Python using composable parsing expressions. Parsek supports both combinator-style and finite state machine (FSM) parsers, and both styles can be mixed seamlessly.
- Minimal footprint: single file, zero deps, (optional minified build)
- Declarative concise grammar in plain Python: no separate grammar file
- Seamlessly blend combinators with FSMs
- Lookahead, lookbehind, backtracking, branching (if_/elif_/else_/endif)
- Versatile accumulation/emission at every step
- Flexible error handling and reporting
- Optional tracing for step-by-step debugging
- Extensively tested
- From PyPI:
pip install parsek
- Or simply copy
src/parsek.pyinto your project.
The parser leverages functional programming techniques, allowing you to easily define complex recursive grammars directly with Python expressions. For example, this complete JSON parser uses just a few lines of code:
from parsek import Parser, Val
def parse_json(s: str):
p = Parser(s) # create parser instance with input string
@p.subroutine # JSON value
def j_val(p: Parser, out, **_kwargs):
return (p.ws. # skip whitespace
if_.one({'null': None, 'true': True, 'false': False}, out).
elif_.one(p.decimal, out).
elif_.one(p.string, out, quotes={'"':'"'}). # JSON strings are double-quoted only
elif_.one(p.collection, j_val, l := []).do(p.accumulate, out, l). # array [...]
else_.one(p.collection, j_kv, d := {}, brackets={'{': '}'}).do(p.accumulate, out, d). # object {...}
endif)
@p.subroutine # key:value pair in JSON object
def j_kv(p: Parser, out, **_kwargs):
return p.one(p.string, k := p.Val(), quotes={'"':'"'}).ws.one(':').ws.one(j_val, (out, k))
r = Val() # resulting JSON goes here
p.ws.one(j_val, r, nomatch='Invalid JSON').ws.one(p.EOF, nomatch='Unexpected trailing input')
return r.valueFor this example and more (CSV, mini DSL, INI/config), see the Examples section below.
Matching input is done with matchers - literals, predicates, subroutines or combinations thereof. Matchers are passed to quantifiers, which perform the actual parsing and accumulation. The simplest quantifier is one(matcher, ...), which matches exactly one occurrence of the matcher at the current input position. On success, the parser advances by the match length (consuming input) and returns success (self). On failure, the position remains unchanged (for simple matchers) or advanced to the failure point (for complex ones like subroutines), returning failure (falsy). All other quantifiers (zero_or_more, one_or_more, etc.) are built on top of one() and accept the same arguments.
from parsek import Parser, In, Not
p = Parser("hello, World!") # instantiate parser with input string
# match 'hello', one or more non-alpha/non-space chars, 'world' (ignoring case), and '!'
p.one('hello').one_or_more(Not(str.isalpha, str.isspace)).one(str.isalpha, ic=True).one(In('!?.'))
assert p.one(p.EOF) # ensure we consumed all inputThe core feature of Parsek is support for chain expressions. All quantifiers and flow-control operations return the current Parser instance (on success) or special opaque object (derived from Parser.Branch or Parser.Stop) that directs further execution of the chain. This allows chaining multiple quantifiers and flow control operations in a single expression. In the example above, if one('hello') fails, the entire expression short-circuits and returns Parser.Fail (a falsy object with is_ok=False). Similarly, if any subsequent quantifier fails, the whole chain fails. With flow-control elements like if_, else_, endif, fail, and side effects like do, chain expressions enable compact and expressive encoding of parsing logic.
A matcher is anything accepted by p.one(...) (or any other quantifier) as first positional argument. Supported kinds:
- Literals
'a','hello'- string literalsParser.EOForParser.END_CHAR- the implicit end-of-input sentinel (also a string)
- Character sets and predicates
- Unary predicate on a single char:
str.isdigit,str.isalpha, customlambda ch: ...- for case-insensitive matching (
ic=True), the passed char is already lower-cased.
- for case-insensitive matching (
- Non-unary predicate on a single char: lambda ch, arg1, ...: ...
- additional args are passed by the quantifier, e.g.
p.one(lambda ch, s: ch in s, 'abc') - for case-insensitive matching (
ic=True), the predicate receives theic=Truekwarg, and lower-casedch.
- additional args are passed by the quantifier, e.g.
In('abc')tests membership,ch in 'abc'p.chars('^_a-zA-Z0-9$')character-class matcher supporting ranges, negation (^), escapes (), and$(theEND_CHARsentinel)
- Unary predicate on a single char:
- Mapping
- Dict key match (literal or
Val) that emits a mapped value into accumulators:{'true': True, 'false': False}
- Dict key match (literal or
- Subroutines
- Functions decorated with
@parser_subroutine(aliases:@Parser.subroutine,@Parser.sr) - may be recursive (as grammar rules often are). Subroutines receive the currentParserinstance as first arg, followed by any positional and keyword args passed by the quantifier. They must return truthy (success) or falsy (failure). Typically if you need to match more than a character/literal as a subpattern, you use a subroutine. - New-stack variant:
@parser_subroutine_new_stackfor FSM-style routines that should not mutate the caller’s state. Creates a new parser instance for the subroutine; only the position is merged back on return. - Case-insensitive kwarg (
ic=True) if present is simply passed down to the subroutine (input is not modified).
- Functions decorated with
- Alternation
- Tuple of matchers:
('a', 'b', str.isdigit)- the first that matches wins. Any of the above matcher kinds can be used inside the tuple. - Tuple/list/set of literals or
Valobjects:('a', 'b', 'hello')- the first that matches wins.
- Tuple of matchers:
- Case-insensitive matching (ignore case, IC)
- Any matcher above with
ic=Trueargument in any quantifier:p.one('world', ic=True) - OR with quantifier name suffix
_ic/i, e.g.,p.one_ic('world')(see Quantifiers). This is just a shorthand foric=Trueand results in the same behavior. - For subroutine matchers and non-unary predicates,
ic=Trueis also passed down to the matcher.
- Any matcher above with
- Negation
Not(matcher1, matcher2, ...)negates any matcher or alternation tuple of matchers. If any matcher matches,Notfails; if all matchers fail,Notmatches (consuming single char).Notis treated specially by the parser for efficiency and correctness.
Examples:
from parsek import Parser, Val, In
p = Parser(input_string)
p.one('a') # match 'a'
p.one(str.islower) # match any lowercase letter
p.one(In('XYZ')) # match any of 'X', 'Y', or 'Z'
p.one(('X', 'Y', 'Z')) # same as above, alternation tuple
p.one(p.chars('XYZ')) # same as above, char set matcher
p.one(p.chars('X-Z')) # same as above, char class matcher
p.one({'X': 1, 'Y': 2, 'Z': 3}, v := Val()) # match and emit mapped value into v
p.one(lambda ch: ch in 'XYZ') # match any of 'X', 'Y', or 'Z' using a custom predicate
p.one(lambda ch, s: ch in s, 'XYZ') # same as above, non-unary predicate with extra arg
@p.subroutine # subroutine matcher
def token(p): # identifier type token: [a-zA-Z_][a-zA-Z0-9_]*
return p.one((str.isalpha, '_')).one_or_more(lambda ch: ch.isalnum() or ch == '_')
p.one(token) # match an identifier token as defined above
# Case-insensitive matchers:
p.one('a', ic=True) # match 'a' or 'A'
p.one_ic('a') # same as above, shorthand with _ic suffix
p.one(p.chars('a-z'), ic=True) # match any letter a-z or A-Z
# Negation:
p.one(Not(str.isalpha)) # match any non-alpha char
p.one(Not(' ', '\t', '\r', '\n')) # match any non-whitespace charQuantifiers control repetition and optionality. All accept the same matcher kinds and arguments as one(...). And in fact, all other quantifiers are implemented on top of one(...). All quantifiers can also accumulate matched text or emitted values into various kinds of sinks (see Accumulation & Emission below). Supported quantifiers:
- Core
one(m, ...)- exactly onezero_or_one(f, ...)- optional oneone_or_more(f, ...)- at least one (greedy)zero_or_more(f, ...)- optional, multiple (greedy)repeat(min, max, f, ...)- at leastmin, at mostmax(both inclusive)- and its aliases:
exactly(n, f, ...),at_least(n, f, ...),at_most(n, f, ...)
- and its aliases:
- Case-insensitive: add
ic=Trueargument to any quantifier, or use_icsuffix on quantifier name, e.g.one_ic,repeat_ic, etc.- the
_icsuffix is just a shorthand and results in the underlying quantifier being called withic=True.
- the
nomatch='message'kwarg on any quantifier raises (or invokes your handler) when no match occurs.
- Short aliases
x1 == onex1_ == one_or_morex0_ == zero_or_morex0_1 == zero_or_onexM_N == repeat(M, N, ...)(dynamic, generated on demand)
- Dynamic quantifiers (generated on demand)
- Exact:
x2(f, ...)- exactly 2 - Ranges:
x2_5(f, ...)- 2 to 5 - "Or more":
x2_(f, ...)- 2 or more - Case-insensitive: suffix
i, e.g.x2i,x2_5i,x2_i
- Exact:
- Spelled-out variants (generated on demand)
two(f, ...),three(f, ...), …two_to_five(f, ...)ortwo_to_5(f, ...)four_or_more(f, ...)- Case-insensitive:
_icsuffix, e.g.two_ic,two_to_five_ic
All quantifiers share the same call signature as one(...):
def one(self, matcher, *args, *, acc=None, ic=False, nomatch=None, **kwargs)➰ Note:
acc,ic, andnomatchare keyword-only arguments and don't actually appear in the formal signature ofone(...)or any other quantifier - they're shown here for clarity.
matcher- any matcher as described above*args,**kwargs- forwarded to the underlyingone()call (repeatedly, by all quantifiers other thanone):- For non-unary predicates or subroutines, these are passed directly to the matcher.
- For subroutines, the current
Parserinstance is automatically passed as the first argument (don't add it yourself).
- For subroutines, the current
- For other matchers, these args/kwargs act as accumulator targets.
- For non-unary predicates or subroutines, these are passed directly to the matcher.
acc- explicit accumulator (sink) for the entire quantifier call. Unlike positional accumulator args, it's used once per quantifier call. Can be one or more accumulators. See Accumulation & Emission for details.ic- if True, enables case-insensitive matching for any matcher that supports it. For subroutines and non-unary predicates,ic=Trueis also passed down to the matcher. Avoid writingic=False(default) as it just clutters the call sites, and must then be added to your subroutine signatures.nomatch- if given (string or callable), raises aValueError(or calls your error handler) when no match occurs.
Examples:
from parsek import Parser, Val
p = Parser("aa123bbbXYZ")
# exactly two 'a'
p.exactly(2, 'a', lambda v: print(f"Got {v}")) # lambda is an accumulator; called twice
# -> prints "Got a" twice
# exactly three digits
p.three(str.isdigit, acc=lambda v: print(f"Got {v}")) # lambda is an accumulator; called once
# -> prints "Got 123" once
# 1..3 'b' into a list
bl = []
p.at_most(3, 'b', bl) # bl is an accumulator
assert bl == ['b', 'b', 'b']
# 3 uppercase letters using a dynamic quantifier
up = Val()
p.x3(str.isupper, up) # up will accumulate the 3 uppercase letters
assert up.value == 'XYZ'Note that you can simply use Python's while loop for any repetition you like. This is a valid and useful technique.
p = Parser("aaaaab")
while p.one('a'):
print('a')While example above is obviously contrived, as zero_or_more('a') would not only be shorter but also faster (while can be just as fast), there are cases where using while is perfectly fine like in the INI/config parser example.
Accumulation/emitting of parsed text can be accomplished in various ways:
- All quantifiers can collect matched text into “sinks” as described in Quantifiers:
- Pass sinks either as
acc=...or as extra positional args (when the matcher does not consume those args). - Supported sinks:
Val- flexible (variant/union) scalar accumulator (str/int/float/bool); has many convenience features that mesh well with Parser API paradigm.list,set- viaappend/add- Mapping sink:
(dict, key, [converter_or_combiner]) - Callable sink:
sink(value) - Multiple targets of the above:
Acc(a, b, ...)or tuple with first element aVal:(val_a, list_b, ...).Accis mostly to disambiguate from a mapping tuple sink. Parser.accumulate(out, v)describes the full contract, and can be used as a convenient emitter in chain expressions (withdo()).
- Pass sinks either as
do()\do_if()- at any point in the parsing tree, call any callable withdo(func, *args, **kwargs)(or predicateddo_if) to invoke a custom side-effect function with any args/kwargs you want:do(print, "Matched", value),do(l.append, value), ordo(lambda: my_emit(value)), etc.p.save_pos('tag')andcopy('tag', sink)- save current position withsave_posat any point, and later copy the input slice from that position to current position into any sink withcopy('tag', sink). This is rarely needed but can be useful in some cases.
Examples:
from parsek import Parser, Val, In
p = Parser("hello abc 123 xyz 456 world")
p.save_pos('hi') # save start position
# Val as string accumulator
a = Val()
p.one_or_more(str.isalpha, a).ws
assert a == 'hello' and a.value == 'hello'
# Mapping sink -> modifies dictionary
d = {}
p.one_or_more_ic(str.isalpha, acc=(d, 'word')).ws
# NOTE: we are using acc= here ⤴︎ to call the mapping sink only once for the whole match
assert d == {'word': 'abc'}
# list as accumulator
n = []
p.one_or_more(str.isdigit, n).ws
assert n == ['1', '2', '3']
# lambda as accumulator
p.one_or_more(In('xyz'), lambda v: print(f"Got {v}")).ws
# -> prints "Got x", "Got y", "Got z"
# accumulating with acc= vs. positional args
l = [] # l will be passed to underlying one() calls, so it will get called 3 times
s = Val() # acc=s will be called once by the at_most for the whole match
p.at_most(3, str.isdigit, l, acc=s).ws
assert l == ['4', '5', '6']
assert s == '456'
# do() to call any function at any point
p.one('world').do(print, "Matched world").do(l.append, 'world')
assert l == ['4', '5', '6', 'world']
# use copy to capture input slice from our saved 'hi' position to current position
all = Val()
p.copy('hi', all)
assert all == 'hello abc 123 xyz 456 world'There are a few ways to handle parsing errors:
- Use
nomatch='message'argument on any quantifier to raise aValueErrorwhen no match occurs. The raised error includes the message, current position, and a snippet of the input around the error position. - Use
nomatch=callableto call your custom error handler instead of raising. The callable receives theParserinstance, from which you can get the current position, input snippet, etc. - call
err()orerr_if()anywhere in the parsing tree to raise an error (or call your handler) and halt the parser immediately. - finally just use
do()to call any function at any point, including raising your own exceptions.
p = Parser("42x")
r = Val()
# require at least one digit
p.one_or_more(str.isdigit, r, nomatch="Expected at least one digit")
# Custom error handler example:
class MyError:
def __init__(self, msg):
self.msg = msg
def __call__(self, p: Parser):
snippet = p.source[max(0, p.pos-5):p.pos+5]
raise ValueError(f"MyError: {self.msg} at pos {p.pos}: ...{snippet}...")
# require exact suffix 'kg' and less than 100kg
p.one('kg', nomatch=MyError("Expected 'kg'")).err_if(lambda: r.value >= 100, MyError("Weight too large"))Parsek supports speculative parsing with lookahead/commit/backtrack, and branching constructs. Think of them as parser equivalents of if / elif / else in Python code. There are also lookbehind (behind) and peek operations that do not consume input.
lookahead(alias:if_): Push current position on the backtrack stack. Must be followed by commit, backtrack, or alt.commit(alias:endif): Pop backtrack stack and commit to the current position.backtrack(): Pop backtrack stack and restore previous position. Note that this is a method and not a property like the others. This is a reminder NOT to call it inside chain expressions (usealtinstead).alt(alias:else_): Create an alternative branch in a chain expression. One alt allowed per lookahead. Automatically commits or backtracks. You can have as many nested lookahead/alt/merge inside an alt branch as needed.merge(alias:endif): Merge an alt branch back into the main flow.elif_: is its own construct that allows multiple branches in an if_ chain expression in lieu of nesting lookaheads/alt/merge.fail,fail_if- fail the current branch early, skips to the next else_/elif_/alt or endif_/merge. Don't confuse witherrwhich raises an error and halts the parser,failis like a silent backtrack.break_,continue_,back,back_ok- these stop the whole chain expression immediately either committing (break_, continue) or backtracking (back, back_ok) and returning success (continue, back_ok) or failure (break_, back). While invaluable, use these sparingly as they can make the parsing logic harder to follow.end- immediately puts the parser in end state, stopping all further parsing, returns success. Parser.is_active will be False after this and Parser.is_end_state will be True.err,err_if- raise an error (or call your error handler) and halt the parser immediately. Don't confuse withfail/fail_ifwhich simply fails the current branch and backtracks silently.behind(matcher, ...)- lookbehind: match the given matcher just before the current position without consuming input. The scan is backwards from current position. Fails if the matcher fails. Arguments are the same as forone().peek(matcher, ...)- peek ahead: match the given matcher at current position without consuming input. The scan is forwards from current position. Fails if the matcher fails. Arguments are the same as forone().
- Every
lookaheadmust be matched bycommit/backtrackoralt/merge. - In chain expressions use
if_/elif_/else_/endifespecially for multiple branches, aselif_has no direct equivalent in lookahead/alt/commit and makes multiple branches much cleaner and easier to read. if_/else_/endifare literal aliases forlookahead/alt/mergeand can be used interchangeably.- Use
fail/fail_ifto backtrack silently;errorerr_ifto halt with error.
A simple lookahead/consume or backtrack, note how tedious it is, we will see how chain expressions streamline this greatly:
p = Parser("abc")
if p.lookahead.one('a').ws.one('b'): # match a and b separated by optional whitespace
p.commit() # consume 'a'
print("Found a & b")
else:
p.backtrack() # restore position
print("No a & b")Now the same with chain expressions:
p.lookahead.one('a').ws.one('b').do(print, "Found a & b").alt.do(print, "No a & b").merge
# same but using the if_/else_ aliases:
p.if_.one('a').ws.one('b').do(print, "Found a & b").else_.do(print, "No a & b").endifNote, not only if_/else_ reads better but also allows multiple branches with elif_ without nesting:
( p.if_.one('a').ws.one('b').do(print, "Found a & b").
elif_.one('b').do(print, "Found just b").
elif_.one('c').do(print, "Found c").
else_.do(print, "No a, b, or c").
endif )
# same as using nested lookahead/alt/merge:
( p.lookahead.one('a').ws.one('b').do(print, "Found a & b").
alt.lookahead.one('b').do(print, "Found just b").
alt.lookahead.one('c').do(print, "Found c").
alt.do(print, "No a, b, or c").
merge. # matching all the merges with
merge. # the corresponding lookaheads
merge ) # gets tedious quickly!Of course each branch can have as many quantifiers and flow control elements as needed, not just a single one(...) like in the examples above.
Since if_ ... endif is a lookahead it never fails as a whole even if it backtracks (that's the entire point of a lookahead). However, the else_ branch is not a lookahead and when it fails, the entire if-elif-else chain fails. Note, elif_, just like if_, is a lookahead so it behaves the same: doesn't fail just backtracks. You can explicitly fail any branch by calling fail.
Consider this example:
p = Parser("hello planet")
ok = p.one('hello').ws.one('world').is_ok
assert not ok and p.pos == 6 # match failed and parser advanced to after 'hello '
p.pos = 0 # reset Parser's position
ok = p.if_.one('hello').ws.one('world').endif.is_ok
assert ok and p.pos == 0 # match OK and parser backtracked to start
ok = p.if_.one('hello').ws.one('world').else_.fail.endif.is_ok
assert not ok and p.pos == 0 # match FAILED and parser backtracked to start
# Now with a successful match:
p = Parser("hello world")
ok = p.one('hello').ws.one('world').is_ok
assert ok and p.pos == 11 # match OK and parser advanced to after 'world' (END_CHAR)
p.pos = 0 # reset Parser's position
ok = p.if_.one('hello').ws.one('world').endif.is_ok
assert ok and p.pos == 11 # match OK and parser parser advanced to after 'world' (END_CHAR)
p.pos = 0 # reset Parser's position
# fail inside if_ branch forces backtrack even though the branch would have succeeded:
ok = p.if_.one('hello').ws.one('world').fail.endif.is_ok
assert ok and p.pos == 0 # match OK and parser backtracked to start
# this could be useful if you want to peek ahead without consuming input.
# Note that there's Parser.peek() for that too but it requires a subroutine:
ok = p.peek(p.sr(lambda p: p.one('hello').ws.one('world'))).is_ok
assert ok and p.pos == 0 # match OK and parser backtracked to start
# if the peek's subroutine fails:
ok = p.peek(p.sr(lambda p: p.one('hello').ws.one('planet'))).is_ok
assert not ok and p.pos == 0 # match FAILED and parser backtracked to start
⚠️ Note: the source of many infinite loop problems is forgetting to handle the end-of-input sentinel:END_CHAR/EOF. See here: handling end-of-input.
⚠️ Note: All arguments in the chain expression are always evaluated even for the failed portions of the chain. While in most cases this is not an issue, it can be if the argument is only valid in certain branches and can result in subtle bugs. In such cases either use reference types (likeVal,list, etc.) or defer evaluation with a callable (like an inline lambda).
Parser.set_trace(level=3, color=True, out=print)- Levels: 0–5 (0=off, 5=most verbose)
- Colorized call-sites, inputs, and decisions
- Low overhead when disabled
Limit all looping quantifiers with Parser.PARSE_LIMIT to catch runaway loops faster:
Parser.PARSE_LIMIT = 100 # default is 100000Tracing is off by default, but if you enabled it for debugging, turn it off when done. Tracing adds significant overhead: it has to walk the call stack with
inspect, parse lambda source code, colorize messages, etc. It is heavy! Also running with-Oeliminates all debug assertions and removes tracing code completely. The minified build does all that without needing-O.
This is obvious but worth mentioning. Avoid having long branches that are likely to fail in favor of the following shorter branch that are more likely to succeed.
Instead of using
one(In('abc'))or similar, cache the matcher in a local variable and use that:
m = p.chars('^abc$') # cache the matcher, even if chars() also caches internally
p.one(m) # faster than p.one(p.chars('^abc$'))When accumulating with a quantifier other than
one, consider using theacc=argument instead of passing the accumulator as a positional argument (if it is appropriate!). Usingacc=calls the accumulator only once per quantifier call instead of passing it to the underlyingonewhich will call it repeatedly on each match (note that sometimes you might want exactly that!). For example,
p = Parser("aaaaab")
s = Val()
p.one_or_more('a', s) # calls s.append('a') 5 times!
p.pos = 0 # reset position
p.one_or_more('a', acc=s) # calls s.append('aaaaa') only once!With
one, usingacc=or positional arg for accumulator makes no difference in such cases. Note, theacc=still useful when usingonewith a subroutine matcher, as all positional args and kwargs, exceptacc=are passed to the subroutine. So accumulator inacc=will get the entire input matched by the subroutine regardless of what the subroutine does/accumulates internally:
p = Parser("3.14")
# parse a decimal number into a float and also capture the string representation
p.one(p.decimal, pi := Val(), acc=(pi_str := Val()))
assert pi.value == 3.14 and pi_str.value == '3.14'
END_CHAR/EOFis Parsek’s implicit end-of-input sentinel. It simplifies parsing logic by unifying termination handling, but it can also cause infinite loops if ignored. Once the parser reaches the end of input, every subsequent read returnsEND_CHAR/EOF— indefinitely. You must therefore handle it explicitly, e.g.:
p.if_.one(p.EOF).end.endifThis puts the parser in the end state, causing all subsequent quantifiers to return immediately with
Parser.End(a truthy value). A common pitfall involves an overly inclusiveNot(...)withoutEND_CHARin its test set; it will keep matching at the end of input and never fail, resulting in an infinite loop. If appropriate, includeEND_CHARin theNot(...)test set, to ensure it stops at the end of input:
m = Not(In('abc' + p.END_CHAR)) # matcher: fails on 'a', 'b', 'c', or END_CHAR
# or equivalently:
m = p.chars('^abc$') # where '$' represents END_CHARDuring initial development you can ignore EOF but set
Parser.PARSE_LIMITto a small value (e.g.20) to catch potential infinite loops quickly. Once the parser behaves correctly, add proper EOF handling and remove the limit.
Every time you use a quantifier the optimized version of a matcher harness is selected. So looping quantifiers like
one_or_morefirst get the optimized harness withget_one_ctx()and then use it repeatedly withone_with_ctx(). You can do the same. For example:
p = Parser("aaaaab")
ctx = p.get_one_ctx('a') # pre-resolve matcher
while p.one_with_ctx(ctx): # slightly faster than calling one('a') repeatedly
pass
get_one_ctx()resolves the type of matcher, whether it's a literal, unary predicate, subroutine, etc., then if it is being negated withNot(...), and finally, if it is case-insensitive (ic=True). It then returns a context object that includes an optimized harness function plus context data like if the matcher takes extra args.
You can mark your custom unary matchers with
@unarydecorator to minimize overhead of checking the matcher type on eachone()orget_one_ctx()call. This also can be used to disambiguate the matcher type.@unarydecorator simply sets__arity=1attribute on the receiver object; andis_unary()checks for that attribute first.
from parsek import Parser, unary
@unary # <- optional, speeds up matcher dispatch
def is_vowel(ch):
return ch in 'aeiou'
p = Parser("aeiouxyz")
p.one_or_more(is_vowel, v := [])
assert v == ['a', 'e', 'i', 'o', 'u']EBNF grammar
value = object | array | string | number | "true" | "false" | "null" ;
object = "{" ws [ member ( ws "," ws member )* ] ws "}" ;
member = string ws ":" ws value ;
array = "[" ws [ value ( ws "," ws value )* ] ws "]" ;
string = '"' chars '"' ;
number = '-'? int frac? exp? ;
int = '0' | [1-9] [0-9]* ;
frac = '.' [0-9]+ ;
exp = ('e'|'E') ('+'|'-')? [0-9]+ ;
ws = (space | tab | newline | carriage return)* ;We use built-in decimal and string subroutines to parse the corresponding literals. Note that Parser.decimal is more permissive than the EBNF number rule above. To handle arrays [...] and objects {...} we use collection with appropriate item subroutines which are recursive. We allow empty items, and trailing commas in arrays and objects, which is not strictly valid JSON but often convenient. You can easily modify the code to disallow both or either empty items or trailing/consecutive commas if needed with empty_item argument to collection.
from parsek import Parser, Val
def parse_json(s: str):
p = Parser(s) # create parser instance with input string
@p.subroutine # JSON value
def j_val(p: Parser, out, **_kwargs):
return (p.ws. # skip whitespace
if_.one({'null': None, 'true': True, 'false': False}, out).
elif_.one(p.decimal, out). # p.decimal is more permissive than ebnf's number
elif_.one(p.string, out, quotes={'"':'"'}). # JSON strings are double-quoted only
elif_.one(p.collection, j_val, l := []).do(p.accumulate, out, l). # array [...]
else_.one(p.collection, j_kv, d := {}, brackets={'{': '}'}).do(p.accumulate, out, d). # object {...}
endif)
@p.subroutine # key:value pair in JSON object
def j_kv(p: Parser, out, **_kwargs):
return p.one(p.string, k := p.Val(), quotes={'"':'"'}).ws.one(':').ws.one(j_val, (out, k))
r = Val() # resulting JSON goes here
p.ws.one(j_val, r, nomatch='Invalid JSON').ws.one(p.EOF, nomatch='Unexpected trailing input')
return r.valueThis parser is similar to the one we use in Parsek's own Parser.chars() method to parse character class specifications like ^a-zA-Z0-9_$. While the format is simple, it has some challenges like handling the \ escapes, context-sensitive ^ (negation only at start, literal otherwise), $ (end-of-input sentinel at the end but literal anywhere else), and - dash which either denotes a range (only between two chars that are themselves not ends of a range) or is a literal.
from parsek import Parser, Not, Val
def parse_char_spec(spec: str):
p = Parser(spec)
trailing_eof = Val(False) # set to True if $ is present at end
@p.sr # escaped or plain char, esc flag will be True if escaped
def char(p, out, esc = None):
return p.if_.one('\\').one(Not(p.EOF), out, esc).else_.one(p.NOT_EOF, out).endif
# NOTE: Not(p.EOF)⤴︎ is common so there's p.NOT_EOF⤴︎ const for it
@p.sr # range: 'a-z' → ('a','z') appended to out
def range_(p, left, out):
return p.one('-').one(char, e := Val()).do(lambda: out.append((left, e.v)))
@p.sr # range, char, or trailing '$'
def atom(p, out):
return (p.one(char, left := Val(), esc := Val(False)).
if_.one(range_, left.v, out).
elif_.fail_if(esc or left != '$').one(p.EOF, lambda _: trailing_eof.set(True)).
else_.do(out.append, left.v).endif)
p.zero_or_one('^', neg := Val(False)).zero_or_more(atom, r := [])
return r, neg.v, trailing_eof.v # -> [('a','z'), 'A'], True, FalseHere's an alternative implementation of the atom subroutine that avoids the extra lookahead branch by relying on the computed p.ch property returning the current character. It does the same thing as above but less parsingly:
@p.sr # range, char, or trailing '$'
def atom(p, out):
return (p.one(char, left := Val(), esc := Val(False)).
if_.one(range_, left.v, out).
else_.do(lambda: trailing_eof.set(True) if (p.ch == p.EOF and left == '$' and not esc) else
out.append(left.v)).
endif)EBNF grammar (click to expand)
csv = record , { nl , record } , [ nl ] ;
record = field , { "," , field } ;
field = quoted_field | bare_field ;
quoted_field = '"' , { qchar | '""' } , '"' ;
bare_field = { char - '"' - "," - nl } ;
qchar = char - '"' ;
char = ? any Unicode character except newline ? ;
nl = "\r\n" | "\n" ;In this CSV parser we don't use any of the built-ins (even though collection could be useful) and it's pretty much just a direct encoding of the EBNF grammar above with each rule corresponding to a subroutine.
from parsek import Parser, Val
def parse_csv(s: str):
p = Parser(s)
rows = []
nl = ('\r\n', '\n') # newline matcher
@p.subroutine # quoted field: '"' { qchar | '""' } '"'
def quoted_field(p: Parser, out):
@p.subroutine # Escaped doubled quote "" (=> ") stop on first unescaped non-quote char
def qcontent(p: Parser, out):
return p.if_.one('""').do(out.append, '"').else_.x1_(p.chars('^"$'), acc=out).endif
return p.one('"').x0_(qcontent, txt := Val('')).one('"').do(out.append, txt.value)
@p.subroutine # Any chars except quote, comma, or newline
def bare_field(p: Parser, out):
txt = Val()
return p.x0_(p.chars('^",\r\n$'), acc=txt).do(out.append, txt.value)
@p.subroutine # field = quoted_field | bare_field
def field(p: Parser, out):
return p.if_.one(quoted_field, out).else_.one(bare_field, out).endif
@p.subroutine # record = field { "," field }
def record(p: Parser):
row = []
return (p.one(field, row).
x0_(p.sr(lambda p: p.one(',').one(field, row))).
do_if(lambda: not (len(row) == 1 and row[0] is None), rows.append, row))
# csv = record { nl record } [ nl ] EOF
p.one(record, nomatch='Invalid CSV record') \
.x0_(p.sr(lambda p: p.one(nl).one(record))) \
.x0_(nl) \
.one(p.END_CHAR, nomatch='Unexpected trailing input')
return rowsEBNF grammar (click to expand)
ini = { ws* ( section | key_value | comment | empty_line ) } ;
section = ws* "[" ws* section_name ws* "]" ws* nl ;
section_name = 1*( char - "[" - "]" - nl ) ;
key_value = ws* key ws* "=" ws* value ws* comment? nl ;
key = { char - "=" - "[" - "]" - "#" - ";" - nl } ;
value = quoted_string | number | boolean | null | bare_value ;
quoted_string = Parser.string ; (* supports both " " and ' ' quotes, with escapes *)
number = Parser.decimal ; (* permissive: allows leading 0, exp, and fractions per Parser.decimal *)
boolean = "true" | "false" | "on" | "off" | "yes" | "no" ; (* case-insensitive *)
null = "null" | "none" ; (* case-insensitive *)
bare_value = { char - "#" - ";" - nl } ; (* then trimmed *)
comment = ws* ("#" | ";") { char - nl } nl ;
empty_line = ws* nl ;
ws = " " | "\t" ;
nl = "\r\n" | "\n" ;
char = ? any Unicode character ? ;In this parser we use Python's while loop to drive the main parsing flow. As we parse each section and key-value pair we emit the results into cfg dictionary using the _enter_section and _set_kv functions, which are called with do at the appropriate points in the parsing tree. Note the use of do_if in key_value where the whole parser branch executes only on some external condition. We introduced the use of aliases for the quantifiers to make the parsing expressions more concise arguably at the cost of some readability. Finally, we added more timely and appropriate error messages with nomatch and err_if.
from parsek import Parser, In, Val
def parse_ini(s: str):
p = Parser(s) # new parser with input string
cfg: dict[str, dict] = {} # result goes here
current_sec = Val(None) # current section name or None (DEFAULT)
def _enter_section(name: str): # emitter: [section]
current_sec.set(name) # update current section
cfg.setdefault(name, {}) # ensure section exists
def _set_kv(k: str, v): # emitter: key = value
sec = current_sec.value or 'DEFAULT' # use DEFAULT section if none specified
cfg.setdefault(sec, {}) # ensure section exists
cfg[sec][k] = v # set key-value in current section
ws = In(' \t') # whitespace matcher
nl = ('\r\n','\n') # newline matcher
@p.subroutine # a comment or empty line(s)
def comment_or_empty(p: Parser):
return (p.zero_or_more(ws). # optional leading spaces
if_.one(('#', ';')).x0_(p.chars('^\n\r$')).zero_or_more(nl).
elif_.one_or_more(nl). # empty line(s) ?
else_.one(p.END_CHAR).end.endif) # if not end-of-input return False
@p.subroutine # value of a key=value pair
def value(p: Parser, out):
return (p.if_.one(p.string, out). # support both " " and ' '
elif_.one(p.decimal, out). # number
elif_.one({'true' : True, 'false': False, # bool literals
'on' : True, 'off' : False,
'yes' : True, 'no' : False, # ic -> ignore case
'null': None, 'none' : None}, out, ic=True).
# bare value until comment/EOL:
else_.one_or_more(p.chars('^#;\n\r$'), acc=lambda v: p.accumulate(out, v.strip())).
endif)
# Lets use aliases, x1 (one), x1_ (one_or_more), x0_ (zero_or_more), etc., for shorter lines:
@p.subroutine # [section] header
def section(p: Parser):
name = Val('')
return (p.x0_(ws).x1('[').
x0_(ws).x1_(p.chars('^[]\n\r$'), acc=(name, lambda _: name.rstrip()), nomatch="Empty section name").
x1(']', nomatch="Expected ']' after section name").x0_(ws).
x1(nl, nomatch="Expected newline after section header").
err_if(Not(name), "Empty section name").
do(_enter_section, name.value))
@p.subroutine # key = value
def key_value(p: Parser):
key = Val('')
return (p.x0_(ws).
x1_(p.chars('^;#=[]\n\r$'), acc=(key, lambda _: key.rstrip())).
do_if(key, lambda: (p.
x1('=', nomatch="Expected '='").err_if(Not(key), "Empty key name").x0_(ws).
if_.x1(value, val := Val()).
x0_(ws).x0_(p.chars('^\n\r$')).x0_(nl).
do(_set_kv, key.value, val.value).
else_.err("Invalid key-value pair").endif)))
while p.is_active and (p. # main loop:
x0_(comment_or_empty). # - skip any comments/empty lines
x0_1(section). # - optional section header
x0_1(key_value)): pass # - optional key=value pairs
return cfg- Parser: (
p) stateful cursor over input with sentinelEND_CHAR/EOFp.sourceinput textp.poscurrent position (int)p.chcurrent char (str) orp.END_CHAR/p.EOFif at or past the endp.END_CHAR/p.EOFend-of-input sentinel (str: '\uFFFF')p.save_pos(tag)- store current position with tag in the Parser's built-in position dictionary, which is separate from the backtrack stack.p.pop_pos(tag)- remove and return the stored position for the given tag. Can also return an offset/trimmed slice from the given position.p.copy(tag, acc)- copy input from the stored position with the given tag to current position into the given accumulator.p.copy_from(pos, acc)- copy input from the given position to current position into the given accumulator.p.slice(n)input substring (len=n) from current pos (includesEND_CHARas needed)p.slice_from(start)input substring from start to current pos (includesEND_CHARas needed)p.slice_behind(n)input substring (len=n) before current pos (includesEND_CHARas needed)
- Matchers: match individual characters or sequences:
- Literals:
'a','hello', - Combination:
('a', 'b', str.isdigit)- Any of the matchers in the tuple - Mapping:
{'true': True, 'false': False}- maps matched literal to output value passed to accumulator - Unary predicate:
str.isalphaand custom callables with any number of args - Negation:
Not(...), which negates any matcher (Notis treated specially by the parser) - Char sets:
In('abc'),p.chars('a-zA-Z0-9'), - Subroutine:
@parser_subroutine(aliases:@p.subroutine,@p.sr), can be recursive!
- Literals:
- Quantifiers: using matchers above to do the actual parsing and accumulation:
one(matcher, *args, *, ic, acc, nomatch, **kwargs)- match exactly one occurrence of the matcher.*argsand**kwargsare passed to the matcher if it is a subroutine matcher. otherwise they are the accumulator(s).ic=Truemakes matching case-insensitive (or passed to subroutine matcher).acc=accumulator(s)
zero_or_one(str.isalpha),one_or_more(...),zero_or_more(...)- multiple/optional occurrencesrepeat(m, n, str.isdigit)and its aliases:exactly(n, ...),at_least(n, ...),at_most(n, ...)- Dynamic:
two(...),two_to_five(...),four_or_more(...), etc. quantifiers generated on-the-fly and cached in the Parser class as regular methods. - Short aliases:
x2(str.isdigit),x2_5(...),x4_(...),x0_(...), etc. p.wsmatcher-quantifier (takes no arguments) that skips anystr.isspacechars; similar (albeit more efficient) to a more verbosezero_or_more(str.isspace).
- Case-insensitive (ignoring case) matching:
- Any quantifier with
ic=Trueargument - Or add
_ic(long forms) ori(short aliases) suffix to quantifier names:one_ic,twelve_or_more_ic,x2_5i, etc.
- Any quantifier with
- Accumulation/emission, passed to quantifiers as plain positional args or
acc=...:- Lists, dicts, sets, anything with
appendoraddmethod - Convenient scalar value accumulator:
Val(str, int, float, bool) - Multiple sinks at once:
tuple(sink1, sink2, ...)orAcc(sink1, sink2, ...) - Callable sinks:
sink(value) - Mapping sinks:
(dict, key, [converter_or_combiner]) Parser.accumulategeneralizes all of the above
- Lists, dicts, sets, anything with
- Side-effects and emission at any point in the parsing tree with
doanddo_if:do(print, 'matched!')calls any callable with any argsdo_if(pred, lambda: print('hello'))evaluatepredand call any callable (with any args) if true
- Branching and flow:
if_/elif_/else_/endif(aliases for lookahead/alt/commit)- lookbehind (no consuming):
behind('abc') - peek ahead (no consuming):
peek('abc') - breaks:
break_,continue_,end,fail,err,back,back_ok - conditional breaks:
check,fail_if,err_if
- Built-in parser subroutines (accessed through Parser class or instance):
decimal- int or float with exponentint_,uint- signed/unsigned base-10 integeridentifier- XID start/continue, e.g. Python identifiersstring- quoted strings, w/ custom escapes/quotes and in-line replacementscollection- generic delimited lists, dicts, etc., w/ generic item-subroutine, custom separators, brackets/no brackets, empty item handling
- Helper classes:
In,Not,Predicate (P)- predicates/matchersVal,Acc- scalar accumulator (string/int/float/bool) and multi-sinks
- Error handling
err(msg)raises ValueError (or calls your handler) with input context at any point in the parsing tree.err_if(pred, ...)raises (or calls your handler) ifpred()is true at that point.nomatch='...'keyword argument in any quantifier raises (or calls your handler) on failed match.on_err=...keyword argument in built-in subroutines, e.g.string,collection.
- FSM parsing
- Use Parser members
state,goto,skip_to,next, and Python'smatch/case - Examples: built-in
stringandcollectionsubroutines are examples of FSM-style parsing @parser_subroutine_new_stackforks another parser instance to isolate state to the subroutine. Position (but not any other state) is merged back on return. Useful for FSM subroutines to avoid modifying the main parser's state.
- Use Parser members
A stripped (minified) build can be generated with utils/minify.py. The stripped version doesn't include comments, docstrings, or tracing/debugging support. See DEV.md for details.
See DEV.md for guidelines. See also CHANGELOG.md.
For its intended use, a Parser instance for each parsing task, Parsek should be safe to use in a multi-threaded environment. Some precautions:
- A parser instance should not be shared between threads (unless you provide your own synchronization).
- Debugging/tracing settings are global to the Parser class and affect all threads - debug your parser in a single thread environment.
MIT License – see License.