Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/format-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ The following commands are available:
<dd>Match the extended regular expression 'str', which must be of
string type. Matching is performed greedily and '.' (a dot)
matches a newline, use <tt>[^\n]</tt> to work around this.
More details about the supported regex syntax can be found [here](regex-spec.md).
Optionally assign the matched string to variable 'name'. Note that
since some string characters have to be escaped already, you might
need to double escape them in a regex string. For example, to
Expand Down
63 changes: 63 additions & 0 deletions doc/regex-spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
Checktestdata Regex specification
=================================

A **reg**ular **ex**pression (or regex) can be used to match strings.
Formally, it describes a set of strings and a string is matched if it is contained in the set.

Regular expressions can contain both literal and special characters.
Most literal characters, like `A`, `a`, or `0`, are the simplest regular expressions and they simply match themselves.
Additionally, more complex regular expressions can be expressed by concatenating simpler regular expressions.
If *A* and *B* are both regular expressions, then *AB* is also a regular expression.
In general, if a string *x* matches *A* and another string *y* matches *B*, then the string *xy* matches *AB*.

Besides the literal characters, there are also the following special characters: `'('`, `')'`, `'{'`, `'}'`, `'['`, `']'`, `'*'`, `'+'`, `'?'`, `'|'`, `'\'`, `'^'`, `'.'`, `'-'`.
Their meaning is as follows:

* `.`: this matches any character, including newlines. If you need to match anything except the newline character use `[^\n]` instead.
* `[]`: indicates a set of characters.
Inside a set definition:
* Literal characters can be listed and all of them will be matched, i.e., `[abc]` will match `'a'`, `'b'` as well as `'c'` but not `'abc'`.
* Ranges can be specified with `-`, for example `[a-z]` will match any lowercase ASCII letter and `[0-9]` will match any digit.
If `-` is escaped (e.g. `[a\-z]`) or if the character preceding it belongs to another range (e.g. `[a-a-z]`), or if it is the first or last character (e.g. `[-a]` or `[a-]`), it will match a literal `'-'`.
It is an error if the first character of the range has a higher code point than the last (e.g., `[z-a]`).
* The complement of a character set is formed if the first character of the set is `^`.
For example `[^a]` will match anything except `'a'`.
If `^` is escaped (e.g. `[\^]`) or if it is not the first character (e.g. `[a^]`) it will match a literal `'^'`.
* `\` can be used to escape a special characters.
However, most special characters do not need to be escaped.
Only `'['` and `']'` must be escaped and `'^'` or `'-'` might need to be escaped depending on the position.
For example both `[\-]` and `[-]` will match a literal `'-'`.
If `\` is not followed by a special characters it matches a literal `'\'`.
* It is an error if the character set does not specify any characters (e.g. `[]` or `[^]`).
* `{m,n}`: causes the resulting regular expression to match from `m` to `n` repetitions of the preceding regular expression.
Matching is done greedily, i.e., as many repetitions as possible are matched.
Omitting *m* specifies a lower bound of zero, and omitting *n* specifies an infinite upper bound.
It is an error if *m* is larger than *n*.
Both *m* and *n* must be an integer without sign and without leading zeros.
It is an error if the preceding regular expression is empty or ends with another repetition (e.g. `{1,2}{1,2}`). If you want to do that use `()` (e.g. `({1,2}){1,2}`).
* `{m}`: is a shorthand for `{m,m}`.
It is an error to omit `m`.
* `*`: is a shorthand for `{0,}`.
* `+`: is a shorthand for `{1,}`.
* `?`: is a shorthand for `{0,1}`.
* `|`: can be used to form the union of two regular expressions.
If *A* and *B* are both regular expressions, then *A|B* is also a regular expression.
In general, if a string *x* matches *A* or it matches *B*, then it also matches *A|B*.
Matching is done in *leftmost-first* fashion.
This means that any match of *A* is preferred over all matches for *B*.
This means that the checktestdata command `REGEX("p|ps")` will only extract `p` even if the input is `ps`.
* `(...)`: if *A* is a regular expression then *(A)* is also a regular expression.
* `\`: escapes the subsequent special character.
If `\` is not followed by a special character it will match a literal `\` (e.g. `\d` will match `'\d'`).
Note that checktestdata strings also use `\` to escape characters.
Therefore, `REGEX("\\*")` becomes the regular expression `\*` and matches a literal `'*'`, not a variable amount of `\`.

## Notes

The regular expression syntax and behaviour is carefully chosen to match a common subset of many modern regular expression definitions and implementations like Perl, Python, JavaScript, Ruby, PHP, Java, C++, Rust, Go, ...
Advanced features like quantifiers, groups, lookahead, lookbehind, etc. are not supported.
Shorthands like `\d` or `[:digit:]` are also not supported, use `[0-9]` instead.

> [!WARNING]
> Earlier versions of checktestdata used POSIX-like regular expressions with *leftmost-longest* matching and support for `[:digit:]`.
> This is no longer supported and matching is done *leftmost-first* instead.
210 changes: 199 additions & 11 deletions libchecktestdata.cc
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ class doesnt_match_exception {};
class eof_found_exception {};
class generate_exception {};

const int display_before_error = 65;
const int display_after_error = 50;
constexpr int display_before_error = 65;
constexpr int display_after_error = 50;

size_t prognr;
const command *currcmd;
Expand All @@ -63,8 +63,8 @@ vector<command> program;
// This stores array-type variables like x[i,j] as string "x" and
// vector of the indices. Plain variables are stored using an index
// vector of zero length.
typedef map<vector<bigint>,value_t> indexmap;
typedef map<value_t,set<vector<bigint>>> valuemap;
using indexmap = map<vector<bigint>,value_t>;
using valuemap = map<value_t,set<vector<bigint>>>;
map<string,indexmap> variable, preset;
map<string,valuemap> rev_variable, rev_preset;

Expand Down Expand Up @@ -214,6 +214,190 @@ long string2int(const string &s)
return res;
}

// Cache for compiled regular expressions
map<string, regex> regex_cache;

// Restrict/adjust c++ standard regex behaviour:
// '.' matches everything, including newline
// '[...]' character set (non empty). '^' form the complement of the charset
// '{m,n}' repeat m to n times (m and n are optional)
// '{m}' repeat exactly m times (m is mandatory)
// '*', '+', '?' shorthand repeat notation
// '|' union of two regex
// '(...)' parenthesis
// '\' escape special characters
class RegexParser {
static constexpr int STATE_EMPTY = 1;
static constexpr int STATE_NONEMPTY = 2;
static constexpr int STATE_REPEAT = 3;

static constexpr char ANY_CHAR = '\0';
static constexpr string_view SPECIAL = "(){}[]*+?|\\^.-";
static constexpr string_view UNSAFE = "(){}[]*+?|\\^.-&#$~";
static constexpr string_view CHARSET_UNSAFE = "[]|\\^-&~";

static bool is_special(char c) {
return SPECIAL.find(&c, 0, 1) != string_view::npos;
}

static bool is_charset_unsafe(char c) {
return CHARSET_UNSAFE.find(&c, 0, 1) != string_view::npos;
}

static bool is_unsafe(char c) {
return UNSAFE.find(&c, 0, 1) != string_view::npos;
}

string raw;
string_view todo;
string out;

string pop() {
size_t len = todo.size() >= 2 && todo[0] == '\\' && is_special(todo[1]) ? 2 : 1;
string token = string(todo.substr(0, len));
todo.remove_prefix(token.size());
return token;
}

void consume(char expected = ANY_CHAR, bool literal = false) {
string token = pop();
if ( expected!=ANY_CHAR && !token.empty() && token[0]!=expected ) {
error("invalid regex: unexpected char");
}
assert(!token.empty());
if ( literal && token.size()==1 && is_unsafe(token[0]) ) out += '\\';
if ( !literal && token=="." ) token = "[\\s\\S]";
out += token;
}

void parse_charset() {
consume('[');
if ( !todo.empty() && todo[0]=='^' ) consume();
vector<string> tmp;
auto flush_tmp = [&](){
for ( string& token : tmp ) {
if ( token.size()==1 && is_charset_unsafe(token[0]) ) out += '\\';
out += token;
}
tmp.clear();
};
bool empty = true;
while ( !todo.empty() && todo[0]!=']' ) {
if ( todo[0]=='[' ) {
error("invalid regex: nested charset?");
}
tmp.push_back(pop());
empty = false;
if ( tmp.size() >= 3 && tmp[tmp.size()-2]=="-" ) {
string lhs = tmp[tmp.size()-3];
string rhs = tmp[tmp.size()-1];
if ( lhs.back()>rhs.back() ) {
error("invalid regex: invalid character range");
}
tmp.pop_back();
tmp.pop_back();
flush_tmp();
out += '-';
tmp.push_back(rhs);
flush_tmp();
}
}
flush_tmp();
if ( empty ) error("empty character set");
consume(']');
}

int parse_non_negative_int() {
string digits = "";
while ( !todo.empty() && todo[0]>='0' && todo[0]<='9' ) {
out += todo[0];
digits += todo[0];
todo.remove_prefix(1);
}
if ( digits.size()>1 && digits[0]=='0' ) {
error("invalid regex: range bound has leading zeros");
}
return digits.empty() ? -1 : string2int(digits);
}

void parse_repeat() {
consume('{');
int lower = parse_non_negative_int();
if ( !todo.empty() && todo[0]==',' ) {
if ( lower < 0 ) out += '0';
consume();
int upper = parse_non_negative_int();
if ( lower>=0 && upper>=0 && lower>upper ) {
error("invalid regex: invalid range");
}
} else if ( lower<0 ) {
error("invalid regex: missing range length");
}
consume('}');
}

void parse() {
int state = STATE_EMPTY;
auto transition = [&](int next) {
if ( next==STATE_REPEAT ) {
if ( state==STATE_EMPTY ) {
error("invalid regex: nothing to repeat");
} else if ( state==STATE_REPEAT ) {
error("invalid regex: multiple repeats");
}
}
state = next;
};
while ( !todo.empty() && todo[0]!=')' ) {
switch ( todo[0] ) {
case '[':
transition(STATE_NONEMPTY);
parse_charset();
break;
case '(':
transition(STATE_NONEMPTY);
consume();
parse();
consume(')');
break;
case '{':
transition(STATE_REPEAT);
parse_repeat();
break;
case '*':
case '+':
case '?':
transition(STATE_REPEAT);
consume();
break;
case '|':
transition(STATE_EMPTY);
consume();
break;
case '.':
transition(STATE_NONEMPTY);
consume();
break;
default:
transition(STATE_NONEMPTY);
consume(ANY_CHAR, true);
}
}
}

public:
RegexParser(const string& raw_) : raw(raw_), todo(raw) {}

regex compile() {
if ( !todo.empty() ) parse();
if ( !todo.empty() ) {
assert(todo[0]==')');
error("invalid regex: unmatched parenthesis");
}
return regex(out, regex::optimize | regex::nosubs);
}
};

// forward declarations
value_t eval(const expr&);
bigint eval_as_int(const expr& e);
Expand Down Expand Up @@ -322,11 +506,11 @@ value_t value(const expr& x)

template<class A, class B>
struct arith_result {
typedef typename conditional<
using type = typename conditional<
is_same<A,bigint>::value && is_same<B,bigint>::value,
bigint,
mpf_class
>::type type;
>::type;
};

template<class A, class B> struct arith_compatible {
Expand All @@ -352,7 +536,7 @@ struct arithmetic_##name : public boost::static_visitor<value_t> {\
}\
template<class A, class B, typename enable_if<arith_compatible<A,B>::value,int>::type = 0,\
class C = typename arith_result<A,B>::type>\
value_t operator()(const A& a, const B& b)const {\
value_t operator()(const A& a, const B& b)const {\
return value_t(C(a op b));\
}\
};\
Expand All @@ -364,12 +548,12 @@ value_t operator op(const value_t &x, const value_t &y) \
#define DECL_VALUE_CMPOP(op,name) \
struct arithmetic_##name : public boost::static_visitor<bool> {\
template<class A, class B, typename enable_if<!is_comparable<A,B>::value,int>::type = 0>\
bool operator()(const A& a, const B& b)const {\
bool operator()(const A& a, const B& b)const {\
cerr << "cannot compute " << a << " " #op " " << b << endl; \
exit(exit_failure);\
}\
template<class A, class B, typename enable_if<is_comparable<A,B>::value,int>::type = 0>\
bool operator()(const A& a, const B& b)const {\
bool operator()(const A& a, const B& b)const {\
return a op b;\
}\
};\
Expand Down Expand Up @@ -952,7 +1136,7 @@ void gentoken(command cmd, ostream &datastream)

else if ( cmd.name()=="REGEX" ) {
string regexstr = eval(cmd.args[0]).getstr();
// regex e1(regex, regex::extended); // this is only to check the expression
// RegexParser(regexstr).compile(); // this is only to check the expression
string str = genregex(regexstr);
datastream << str;
if ( cmd.nargs()>=2 ) setvar(cmd.args[1],value_t(str));
Expand Down Expand Up @@ -1114,7 +1298,11 @@ void checktoken(const command& cmd)

else if ( cmd.name()=="REGEX" ) {
string str = eval(cmd.args[0]).getstr();
regex regexstr(str,regex::extended|regex::nosubs|regex::optimize);
auto cache_it = regex_cache.find(str);
if ( cache_it == regex_cache.end() ) {
cache_it = regex_cache.emplace(str, RegexParser(str).compile()).first;
}
regex regexstr = cache_it->second;
smatch res;
string matchstr;

Expand Down
10 changes: 5 additions & 5 deletions libchecktestdata.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@

namespace checktestdata {

const int exit_failure = 2;
constexpr int exit_failure = 2;

const int opt_whitespace_ok = 1; // ignore additional whitespace
const int opt_quiet = 2; // quiet execution: only return status
const int opt_debugging = 4; // print additional debugging statements
constexpr int opt_whitespace_ok = 1; // ignore additional whitespace
constexpr int opt_quiet = 2; // quiet execution: only return status
constexpr int opt_debugging = 4; // print additional debugging statements

const int float_precision = 15; // output precision (digits) of floats
constexpr int float_precision = 15; // output precision (digits) of floats

void init_checktestdata(std::istream &progstream, int opt_mask = 0, long seed = -1);
/* Initialize libchecktestdata by loading syntax from progstream and
Expand Down
2 changes: 1 addition & 1 deletion tests/test_23_prog.in
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
SET(foo="bar.*")
STRING(foo) NEWLINE
REGEX(foo) # Note that '.' also matches newlines and ERE is greedy.
REGEX(foo) # Note that '.' also matches newlines and is greedy.
1 change: 1 addition & 0 deletions tests/test_regex1_data.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
a afJ7bayb
12 changes: 12 additions & 0 deletions tests/test_regex1_prog.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# IGNORE GENERATE TESTING
REGEX("a a") # contains space
REGEX("[e-h]") # character class
REGEX("[I-M]") # character class
REGEX("[5-8]") # character class
REGEX("[^a]") # character class
STRING("a")
REGEX("x?") # optional
REGEX("y?") # optional
REGEX("z?") # optional
STRING("b")
REGEX(".+") # any including newline
2 changes: 2 additions & 0 deletions tests/test_regex2_data.err1
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1
"[]"
2 changes: 2 additions & 0 deletions tests/test_regex2_data.err10
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1
"*" *
2 changes: 2 additions & 0 deletions tests/test_regex2_data.err11
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1
"+" +
2 changes: 2 additions & 0 deletions tests/test_regex2_data.err12
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1
"?" ?
2 changes: 2 additions & 0 deletions tests/test_regex2_data.err13
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1
"(" (
Loading