Gathering detailed insights and metrics for regexp-tree
Gathering detailed insights and metrics for regexp-tree
Gathering detailed insights and metrics for regexp-tree
Gathering detailed insights and metrics for regexp-tree
npm install regexp-tree
Module System
Unable to determine the module system for this package.
Min. Node Version
Typescript Support
Node Version
NPM Version
400 Stars
620 Commits
45 Forks
12 Watching
1 Branches
23 Contributors
Updated on 20 Nov 2024
JavaScript (99.89%)
Shell (0.11%)
Cumulative downloads
Total Downloads
Last day
-1.9%
1,094,437
Compared to previous day
Last week
5.8%
5,834,621
Compared to previous week
Last month
19.5%
23,120,893
Compared to previous month
Last year
42.1%
220,483,199
Compared to previous year
Regular expressions processor in JavaScript
TL;DR: RegExp Tree is a regular expressions processor, which includes parser, traversal, transformer, optimizer, and interpreter APIs.
You can get an overview of the tool in this article.
The parser can be installed as an npm module:
npm install -g regexp-tree
You can also try it online using AST Explorer.
npm test
still passes (add new tests if needed)The regexp-tree parser is implemented as an automatic LR parser using Syntax tool. The parser module is generated from the regexp grammar, which is based on the regular expressions grammar used in ECMAScript.
For development from the github repository, run build
command to generate the parser module, and transpile JS code:
git clone https://github.com/<your-github-account>/regexp-tree.git
cd regexp-tree
npm install
npm run build
NOTE: JS code transpilation is used to support older versions of Node. For faster development cycle you can use
npm run watch
command, which continuously transpiles JS code.
Note: the CLI is exposed as its own regexp-tree-cli module.
Check the options available from CLI:
regexp-tree-cli --help
Usage: regexp-tree-cli [options]
Options:
-e, --expression A regular expression to be parsed
-l, --loc Whether to capture AST node locations
-o, --optimize Applies optimizer on the passed expression
-c, --compat Applies compat-transpiler on the passed expression
-t, --table Print NFA/DFA transition tables (nfa/dfa/all)
To parse a regular expression, pass -e
option:
regexp-tree-cli -e '/a|b/i'
Which produces an AST node corresponding to this regular expression:
1{ 2 type: 'RegExp', 3 body: { 4 type: 'Disjunction', 5 left: { 6 type: 'Char', 7 value: 'a', 8 symbol: 'a', 9 kind: 'simple', 10 codePoint: 97 11 }, 12 right: { 13 type: 'Char', 14 value: 'b', 15 symbol: 'b', 16 kind: 'simple', 17 codePoint: 98 18 } 19 }, 20 flags: 'i', 21}
NOTE: the format of a regexp is
/ Body / OptionalFlags
.
The parser can also be used as a Node module:
1const regexpTree = require('regexp-tree'); 2 3console.log(regexpTree.parse(/a|b/i)); // RegExp AST
Note, regexp-tree supports parsing regexes from strings, and also from actual RegExp
objects (in general -- from any object which can be coerced to a string). If some feature is not implemented yet in an actual JavaScript RegExp, it should be passed as a string:
1// Pass an actual JS RegExp object. 2regexpTree.parse(/a|b/i); 3 4// Pass a string, since `s` flag may not be supported in older versions. 5regexpTree.parse('/./s');
Also note, that in string-mode, escaping is done using two slashes \\
per JavaScript:
1// As an actual regexp. 2regexpTree.parse(/\n/); 3 4// As a string. 5regexpTree.parse('/\\n/');
For source code transformation tools it might be useful also to capture locations of the AST nodes. From the command line it's controlled via the -l
option:
regexp-tree-cli -e '/ab/' -l
This attaches loc
object to each AST node:
1{ 2 type: 'RegExp', 3 body: { 4 type: 'Alternative', 5 expressions: [ 6 { 7 type: 'Char', 8 value: 'a', 9 symbol: 'a', 10 kind: 'simple', 11 codePoint: 97, 12 loc: { 13 start: { 14 line: 1, 15 column: 1, 16 offset: 1, 17 }, 18 end: { 19 line: 1, 20 column: 2, 21 offset: 2, 22 }, 23 } 24 }, 25 { 26 type: 'Char', 27 value: 'b', 28 symbol: 'b', 29 kind: 'simple', 30 codePoint: 98, 31 loc: { 32 start: { 33 line: 1, 34 column: 2, 35 offset: 2, 36 }, 37 end: { 38 line: 1, 39 column: 3, 40 offset: 3, 41 }, 42 } 43 } 44 ], 45 loc: { 46 start: { 47 line: 1, 48 column: 1, 49 offset: 1, 50 }, 51 end: { 52 line: 1, 53 column: 3, 54 offset: 3, 55 }, 56 } 57 }, 58 flags: '', 59 loc: { 60 start: { 61 line: 1, 62 column: 0, 63 offset: 0, 64 }, 65 end: { 66 line: 1, 67 column: 4, 68 offset: 4, 69 }, 70 } 71}
From Node it's controlled via setOptions
method exposed on the parser:
1const regexpTree = require('regexp-tree'); 2 3const parsed = regexpTree 4 .parser 5 .setOptions({captureLocations: true}) 6 .parse(/a|b/);
The setOptions
method sets global options, which are preserved between calls. It is also possible to provide options per a single parse
call, which might be more preferred:
1const regexpTree = require('regexp-tree'); 2 3const parsed = regexpTree.parse(/a|b/, { 4 captureLocations: true, 5});
The parser supports several options which can be set globally via the setOptions
method on the parser, or by passing them with each parse
method invocation.
Example:
1const regexpTree = require('regexp-tree'); 2 3const parsed = regexpTree.parse(/a|b/, { 4 allowGroupNameDuplicates: true, 5});
The following options are supported:
captureLocations: boolean
-- whether to capture AST node locations (false
by default)allowGroupNameDuplicates: boolean
-- whether to skip duplicates check of the named capturing groupsSet allowGroupNameDuplicates
would make the following expression possible:
1/ 2 # YYY-MM-DD date format: 3 4 (?<year> \d{4}) - 5 (?<month> \d{2}) - 6 (?<day> \d{2}) 7 8 | 9 10 # DD.MM.YYY date format 11 12 (?<day> \d{2}) . 13 (?<month> \d{2}) . 14 (?<year> \d{4}) 15 16/x
The traverse module allows handling needed AST nodes using the visitor pattern. In Node the module is exposed as the regexpTree.traverse
method. Handlers receive an instance of the NodePath class, which encapsulates node
itself, its parent
node, property
, and index
(in case the node is part of a collection).
Visiting a node follows this algorithm:
pre
handler.post
handler.For each node type of interest, you can provide either:
pre
).pre
and post
.You can also provide a *
handler which will be executed on every node.
Example:
1const regexpTree = require('regexp-tree'); 2 3// Get AST. 4const ast = regexpTree.parse('/[a-z]{1,}/'); 5 6// Traverse AST nodes. 7regexpTree.traverse(ast, { 8 9 // Visit every node before any type-specific handlers. 10 '*': function({node}) { 11 ... 12 }, 13 14 // Handle "Quantifier" node type. 15 Quantifier({node}) { 16 ... 17 }, 18 19 // Handle "Char" node type, before and after. 20 Char: { 21 pre({node}) { 22 ... 23 }, 24 post({node}) { 25 ... 26 } 27 } 28 29}); 30 31// Generate the regexp. 32const re = regexpTree.generate(ast); 33 34console.log(re); // '/[a-z]+/'
NOTE: you can play with transformation APIs, and write actual transforms for quick tests in AST Explorer. See this example.
While traverse module provides basic traversal API, which can be used for any purposes of AST handling, transform module focuses mainly on transformation of regular expressions.
It accepts a regular expressions in different formats (string, an actual RegExp
object, or an AST), applies a set of transformations, and retuns an instance of TransformResult. Handles receive as a parameter the same NodePath object used in traverse.
Example:
1const regexpTree = require('regexp-tree'); 2 3// Handle nodes. 4const re = regexpTree.transform('/[a-z]{1,}/i', { 5 6 /** 7 * Handle "Quantifier" node type, 8 * transforming `{1,}` quantifier to `+`. 9 */ 10 Quantifier(path) { 11 const {node} = path; 12 13 // {1,} -> + 14 if ( 15 node.kind === 'Range' && 16 node.from === 1 && 17 !node.to 18 ) { 19 path.replace({ 20 type: 'Quantifier', 21 kind: '+', 22 greedy: node.greedy, 23 }); 24 } 25 }, 26}); 27 28console.log(re.toString()); // '/[a-z]+/i' 29console.log(re.toRegExp()); // /[a-z]+/i 30console.log(re.getAST()); // AST for /[a-z]+/i
A transformation plugin is a module which exports a transformation handler. We have seen above how we can pass a handler object directly to the regexpTree.transform
method, here we extract it into a separate module, so it can be implemented and shared independently:
Example of a plugin:
1// file: ./regexp-tree-a-to-b-transform.js 2 3 4/** 5 * This plugin replaces chars 'a' with chars 'b'. 6 */ 7module.exports = { 8 Char({node}) { 9 if (node.kind === 'simple' && node.value === 'a') { 10 node.value = 'b'; 11 node.symbol = 'b'; 12 node.codePoint = 98; 13 } 14 }, 15};
Once we have this plugin ready, we can require it, and pass to the transform
function:
1const regexpTree = require('regexp-tree'); 2const plugin = require('./regexp-tree-a-to-b-transform'); 3 4const re = regexpTree.transform(/(a|c)a+[a-z]/, plugin); 5 6console.log(re.toRegExp()); // /(b|c)b+[b-z]/
NOTE: we can also pass a list of plugins to the
regexpTree.transform
. In this case the plugins are applied in one pass in order. Another approach is to run several sequential calls totransform
, setting up a pipeline, when a transformed AST is passed further to another plugin, etc.
You can see other examples of transform plugins in the optimizer/transforms or in the compat-transpiler/transforms directories.
The generator module generates regular expressions from corresponding AST nodes. In Node the module is exposed as regexpTree.generate
method.
Example:
1const regexpTree = require('regexp-tree'); 2 3const re = regexpTree.generate({ 4 type: 'RegExp', 5 body: { 6 type: 'Char', 7 value: 'a', 8 symbol: 'a', 9 kind: 'simple', 10 codePoint: 97 11 }, 12 flags: 'i', 13}); 14 15console.log(re); // '/a/i'
Optimizer transforms your regexp into an optimized version, replacing some sub-expressions with their idiomatic patterns. This might be good for different kinds of minifiers, as well as for regexp machines.
NOTE: the Optimizer is implemented as a set of regexp-tree plugins.
Example:
1const regexpTree = require('regexp-tree'); 2 3const originalRe = /[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/; 4 5const optimizedRe = regexpTree 6 .optimize(originalRe) 7 .toRegExp(); 8 9console.log(optimizedRe); // /\w+e+/
From CLI the optimizer is available via --optimize
(-o
) option:
regexp-tree-cli -e '/[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/' -o
Result:
Optimized: /\w+e+/
See the optimizer README for more details.
The optimizer module is also available as an ESLint plugin, which can be installed at: eslint-plugin-optimize-regex.
The compat-transpiler module translates your regexp in new format or in new syntax, into an equivalent regexp in a legacy representation, so it can be used in engines which don't yet implement the new syntax.
NOTE: the compat-transpiler is implemented as a set of regexp-tree plugins.
Example, "dotAll" s
flag:
1/./s
Is translated into:
1/[\0-\uFFFF]/
1/(?<value>a)\k<value>\1/
Becomes:
1/(a)\1\1/
To use the API from Node:
1const regexpTree = require('regexp-tree'); 2 3// Using new syntax. 4const originalRe = '/(?<all>.)\\k<all>/s'; 5 6// For legacy engines. 7const compatTranspiledRe = regexpTree 8 .compatTranspile(originalRe) 9 .toRegExp(); 10 11console.log(compatTranspiledRe); // /([\0-\uFFFF])\1/
From CLI the compat-transpiler is available via --compat
(-c
) option:
regexp-tree-cli -e '/(?<all>.)\k<all>/s' -c
Result:
Compat: /([\0-\uFFFF])\1/
The compat-transpiler module is also available as a Babel plugin, which can be installed at: babel-plugin-transform-modern-regexp.
Note, the plugin also includes extended regexp features.
Some of the non-standard feature are also supported by regexp-tree.
NOTE: "non-standard" means specifically ECMAScript standard, since in other regexp egnines, e.g. PCRE, Python, etc. these features are standard.
One of such features is the x
flag, which enables extended mode of regular expressions. In this mode most of whitespaces are ignored, and expressions can use #-comments.
Example:
1/ 2 # A regular expression for date. 3 4 (?<year>\d{4})- # year part of a date 5 (?<month>\d{2})- # month part of a date 6 (?<day>\d{2}) # day part of a date 7 8/x
This is normally parsed by the regexp-tree parser, and compat-transpiler has full support for it; it's translated into:
1/(\d{4})-(\d{2})-(\d{2})/
The regexp extensions are also available as a Babel plugin, which can be installed at: babel-plugin-transform-modern-regexp.
Note, the plugin also includes compat-transpiler features.
To create an actual RegExp
JavaScript object, we can use regexpTree.toRegExp
method:
1const regexpTree = require('regexp-tree'); 2 3const re = regexpTree.toRegExp('/[a-z]/i'); 4 5console.log( 6 re.test('a'), // true 7 re.test('Z'), // true 8);
It is also possible to execute regular expressions using exec
API method, which has support for new syntax, and features, such as named capturing group, etc:
1const regexpTree = require('regexp-tree'); 2 3const re = `/ 4 5 # A regular expression for date. 6 7 (?<year>\\d{4})- # year part of a date 8 (?<month>\\d{2})- # month part of a date 9 (?<day>\\d{2}) # day part of a date 10 11/x`; 12 13const string = '2017-04-14'; 14 15const result = regexpTree.exec(re, string); 16 17console.log(result.groups); // {year: '2017', month: '04', day: '14'}
NOTE: you can read more about implementation details of the interpreter in this series of articles.
In addition to executing regular expressions using JavaScript built-in RegExp engine, RegExp Tree also implements own interpreter based on classic NFA/DFA finite automaton engine.
Currently it aims educational purposes -- to trace the regexp matching process, transitioning in NFA/DFA states. It also allows building state transitioning table, which can be used for custom implementation. In API the module is exposed as fa
(finite-automaton) object.
Example:
1const {fa} = require('regexp-tree'); 2 3const re = /ab|c*/; 4 5console.log(fa.test(re, 'ab')); // true 6console.log(fa.test(re, '')); // true 7console.log(fa.test(re, 'c')); // true 8 9// NFA, and its transition table. 10const nfa = fa.toNFA(re); 11console.log(nfa.getTransitionTable()); 12 13// DFA, and its transition table. 14const dfa = fa.toDFA(re); 15console.log(dfa.getTransitionTable());
For more granular work with NFA and DFA, fa
module also exposes convenient builders, so you can build NFA fragments directly:
1const {fa} = require('regexp-tree'); 2 3const { 4 alt, 5 char, 6 or, 7 rep, 8} = fa.builders; 9 10// ab|c* 11const re = or( 12 alt(char('a'), char('b')), 13 rep(char('c')) 14); 15 16console.log(re.matches('ab')); // true 17console.log(re.matches('')); // true 18console.log(re.matches('c')); // true 19 20// Build DFA from NFA 21const {DFA} = fa; 22 23const reDFA = new DFA(re); 24 25console.log(reDFA.matches('ab')); // true 26console.log(reDFA.matches('')); // true 27console.log(reDFA.matches('c')); // true
The --table
option allows displaying NFA/DFA transition tables. RegExp Tree also applies DFA minimization (using N-equivalence algorithm), and produces the minimal transition table as its final result.
In the example below for the /a|b|c/
regexp, we first obtain the NFA transition table, which is further converted to the original DFA transition table (down from the 10 non-deterministic states to 4 deterministic states), and eventually minimized to the final DFA table (from 4 to only 2 states).
./bin/regexp-tree-cli -e '/a|b|c/' --table all
Result:
> - starting
✓ - accepting
NFA transition table:
┌─────┬───┬───┬────┬─────────────┐
│ │ a │ b │ c │ ε* │
├─────┼───┼───┼────┼─────────────┤
│ 1 > │ │ │ │ {1,2,3,7,9} │
├─────┼───┼───┼────┼─────────────┤
│ 2 │ │ │ │ {2,3,7} │
├─────┼───┼───┼────┼─────────────┤
│ 3 │ 4 │ │ │ 3 │
├─────┼───┼───┼────┼─────────────┤
│ 4 │ │ │ │ {4,5,6} │
├─────┼───┼───┼────┼─────────────┤
│ 5 │ │ │ │ {5,6} │
├─────┼───┼───┼────┼─────────────┤
│ 6 ✓ │ │ │ │ 6 │
├─────┼───┼───┼────┼─────────────┤
│ 7 │ │ 8 │ │ 7 │
├─────┼───┼───┼────┼─────────────┤
│ 8 │ │ │ │ {8,5,6} │
├─────┼───┼───┼────┼─────────────┤
│ 9 │ │ │ 10 │ 9 │
├─────┼───┼───┼────┼─────────────┤
│ 10 │ │ │ │ {10,6} │
└─────┴───┴───┴────┴─────────────┘
DFA: Original transition table:
┌─────┬───┬───┬───┐
│ │ a │ b │ c │
├─────┼───┼───┼───┤
│ 1 > │ 4 │ 3 │ 2 │
├─────┼───┼───┼───┤
│ 2 ✓ │ │ │ │
├─────┼───┼───┼───┤
│ 3 ✓ │ │ │ │
├─────┼───┼───┼───┤
│ 4 ✓ │ │ │ │
└─────┴───┴───┴───┘
DFA: Minimized transition table:
┌─────┬───┬───┬───┐
│ │ a │ b │ c │
├─────┼───┼───┼───┤
│ 1 > │ 2 │ 2 │ 2 │
├─────┼───┼───┼───┤
│ 2 ✓ │ │ │ │
└─────┴───┴───┴───┘
Below are the AST node types for different regular expressions patterns:
A basic building block, single character. Can be escaped, and be of different kinds.
Basic non-escaped char in a regexp:
z
Node:
1{ 2 type: 'Char', 3 value: 'z', 4 symbol: 'z', 5 kind: 'simple', 6 codePoint: 122 7}
NOTE: to test this from CLI, the char should be in an actual regexp --
/z/
.
\z
The same value, escaped
flag is added:
1{ 2 type: 'Char', 3 value: 'z', 4 symbol: 'z', 5 kind: 'simple', 6 codePoint: 122, 7 escaped: true 8}
Escaping is mostly used with meta symbols:
// Syntax error
*
\*
OK, node:
1{ 2 type: 'Char', 3 value: '*', 4 symbol: '*', 5 kind: 'simple', 6 codePoint: 42, 7 escaped: true 8}
A meta character should not be confused with an escaped char.
Example:
\n
Node:
1{ 2 type: 'Char', 3 value: '\\n', 4 symbol: '\n', 5 kind: 'meta', 6 codePoint: 10 7}
Among other meta character are: .
, \f
, \r
, \n
, \t
, \v
, \0
, [\b]
(backspace char), \s
, \S
, \w
, \W
, \d
, \D
.
NOTE: Meta characters representing ranges (like
.
,\s
, etc.) haveundefined
value forsymbol
andNaN
forcodePoint
.
NOTE:
\b
and\B
are parsed asAssertion
node type, notChar
.
A char preceded with \c
, e.g. \cx
, which stands for CTRL+x
:
\cx
Node:
1{ 2 type: 'Char', 3 value: '\\cx', 4 symbol: undefined, 5 kind: 'control', 6 codePoint: NaN 7}
A char preceded with \x
, followed by a HEX-code, e.g. \x3B
(symbol ;
):
\x3B
Node:
1{ 2 type: 'Char', 3 value: '\\x3B', 4 symbol: ';', 5 kind: 'hex', 6 codePoint: 59 7}
Char-code:
\42
Node:
1{ 2 type: 'Char', 3 value: '\\42', 4 symbol: '*', 5 kind: 'decimal', 6 codePoint: 42 7}
Char-code started with \0
, followed by an octal number:
\073
Node:
1{ 2 type: 'Char', 3 value: '\\073', 4 symbol: ';', 5 kind: 'oct', 6 codePoint: 59 7}
Unicode char started with \u
, followed by a hex number:
\u003B
Node:
1{ 2 type: 'Char', 3 value: '\\u003B', 4 symbol: ';', 5 kind: 'unicode', 6 codePoint: 59 7}
When using the u
flag, unicode chars can also be represented using \u
followed by a hex number between curly braces:
\u{1F680}
Node:
1{ 2 type: 'Char', 3 value: '\\u{1F680}', 4 symbol: '🚀', 5 kind: 'unicode', 6 codePoint: 128640 7}
When using the u
flag, unicode chars can also be represented using a surrogate pair:
\ud83d\ude80
Node:
1{ 2 type: 'Char', 3 value: '\\ud83d\\ude80', 4 symbol: '🚀', 5 kind: 'unicode', 6 codePoint: 128640, 7 isSurrogatePair: true 8}
Character classes define a set of characters. A set may include as simple characters, as well as character ranges. A class can be positive (any from the characters in the class match), or negative (any but the characters from the class match).
A positive character class is defined between [
and ]
brackets:
[a*]
A node:
1{ 2 type: 'CharacterClass', 3 expressions: [ 4 { 5 type: 'Char', 6 value: 'a', 7 symbol: 'a', 8 kind: 'simple', 9 codePoint: 97 10 }, 11 { 12 type: 'Char', 13 value: '*', 14 symbol: '*', 15 kind: 'simple', 16 codePoint: 42 17 } 18 ] 19}
NOTE: some meta symbols are treated as normal characters in a character class. E.g.
*
is not a repetition quantifier, but a simple char.
A negative character class is defined between [^
and ]
brackets:
[^ab]
An AST node is the same, just negative
property is added:
1{ 2 type: 'CharacterClass', 3 negative: true, 4 expressions: [ 5 { 6 type: 'Char', 7 value: 'a', 8 symbol: 'a', 9 kind: 'simple', 10 codePoint: 97 11 }, 12 { 13 type: 'Char', 14 value: 'b', 15 symbol: 'b', 16 kind: 'simple', 17 codePoint: 98 18 } 19 ] 20}
As mentioned, a character class may also contain ranges of symbols:
[a-z]
A node:
1{ 2 type: 'CharacterClass', 3 expressions: [ 4 { 5 type: 'ClassRange', 6 from: { 7 type: 'Char', 8 value: 'a', 9 symbol: 'a', 10 kind: 'simple', 11 codePoint: 97 12 }, 13 to: { 14 type: 'Char', 15 value: 'z', 16 symbol: 'z', 17 kind: 'simple', 18 codePoint: 122 19 } 20 } 21 ] 22}
NOTE: it is a syntax error if
to
value is less thanfrom
value:/[z-a]/
.
The range value can be the same for from
and to
, and the special range -
character is treated as a simple character when it stands in a char position:
// from: 'a', to: 'a'
[a-a]
// from: '-', to: '-'
[---]
// simple '-' char:
[-]
// 3 ranges:
[a-zA-Z0-9]+
Unicode property escapes are a new type of escape sequence available in regular expressions that have the u
flag set. With this feature it is possible to write Unicode expressions as:
1const greekSymbolRe = /\p{Script=Greek}/u; 2 3greekSymbolRe.test('π'); // true
The AST node for this expression is:
1{ 2 type: 'UnicodeProperty', 3 name: 'Script', 4 value: 'Greek', 5 negative: false, 6 shorthand: false, 7 binary: false, 8 canonicalName: 'Script', 9 canonicalValue: 'Greek' 10}
All possible property names, values, and their aliases can be found at the specification.
For General_Category
it is possible to use a shorthand:
1/\p{Letter}/u; // Shorthand 2 3/\p{General_Category=Letter}/u; // Full notation
Binary names use the single value as well:
1/\p{ASCII_Hex_Digit}/u; // Same as: /[0-9A-Fa-f]/
The capitalized P
defines the negation of the expression:
1/\P{ASCII_Hex_Digit}/u; // NOT a ASCII Hex digit
An alternative (or concatenation) defines a chain of patterns followed one after another:
abc
A node:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Char', 6 value: 'a', 7 symbol: 'a', 8 kind: 'simple', 9 codePoint: 97 10 }, 11 { 12 type: 'Char', 13 value: 'b', 14 symbol: 'b', 15 kind: 'simple', 16 codePoint: 98 17 }, 18 { 19 type: 'Char', 20 value: 'c', 21 symbol: 'c', 22 kind: 'simple', 23 codePoint: 99 24 } 25 ] 26}
Another examples:
// 'a' with a quantifier, followed by 'b'
a?b
// A group followed by a class:
(ab)[a-z]
The disjunction defines "OR" operation for regexp patterns. It's a binary operation, having left
, and right
nodes.
Matches a
or b
:
a|b
A node:
1{ 2 type: 'Disjunction', 3 left: { 4 type: 'Char', 5 value: 'a', 6 symbol: 'a', 7 kind: 'simple', 8 codePoint: 97 9 }, 10 right: { 11 type: 'Char', 12 value: 'b', 13 symbol: 'b', 14 kind: 'simple', 15 codePoint: 98 16 } 17}
The groups play two roles: they define grouping precedence, and allow to capture needed sub-expressions in case of a capturing group.
"Capturing" means the matched string can be referred later by a user, including in the pattern itself -- by using backreferences.
Char a
, and b
are grouped, followed by the c
char:
(ab)c
A node:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Group', 6 capturing: true, 7 number: 1, 8 expression: { 9 type: 'Alternative', 10 expressions: [ 11 { 12 type: 'Char', 13 value: 'a', 14 symbol: 'a', 15 kind: 'simple', 16 codePoint: 97 17 }, 18 { 19 type: 'Char', 20 value: 'b', 21 symbol: 'b', 22 kind: 'simple', 23 codePoint: 98 24 } 25 ] 26 } 27 }, 28 { 29 type: 'Char', 30 value: 'c', 31 symbol: 'c', 32 kind: 'simple', 33 codePoint: 99 34 } 35 ] 36}
As we can see, it also tracks the number of the group.
Another example:
// A grouped disjunction of a symbol, and a character class:
(5|[a-z])
A capturing group can be given a name using the (?<name>...)
syntax, for any identifier name
.
For example, a regular expressions for a date:
1/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u
For the group:
1(?<foo>x)
We have the following node (the name
property with value foo
is added):
1{ 2 type: 'Group', 3 capturing: true, 4 name: 'foo', 5 nameRaw: 'foo', 6 number: 1, 7 expression: { 8 type: 'Char', 9 value: 'x', 10 symbol: 'x', 11 kind: 'simple', 12 codePoint: 120 13 } 14}
Note: The nameRaw
property represents the name as parsed from the original source, including escape sequences. The name
property represents the canonical decoded form of the name.
For example, given the /u
flag and the following group:
1(?<\u{03C0}>x)
We would have the following node:
1{ 2 type: 'Group', 3 capturing: true, 4 name: 'π', 5 nameRaw: '\\u{03C0}', 6 number: 1, 7 expression: { 8 type: 'Char', 9 value: 'x', 10 symbol: 'x', 11 kind: 'simple', 12 codePoint: 120 13 } 14}
Sometimes we don't need to actually capture the matched string from a group. In this case we can use a non-capturing group:
Char a
, and b
are grouped, but not captured, followed by the c
char:
(?:ab)c
The same node, the capturing
flag is false
:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Group', 6 capturing: false, 7 expression: { 8 type: 'Alternative', 9 expressions: [ 10 { 11 type: 'Char', 12 value: 'a', 13 symbol: 'a', 14 kind: 'simple', 15 codePoint: 97 16 }, 17 { 18 type: 'Char', 19 value: 'b', 20 symbol: 'b', 21 kind: 'simple', 22 codePoint: 98 23 } 24 ] 25 } 26 }, 27 { 28 type: 'Char', 29 value: 'c', 30 symbol: 'c', 31 kind: 'simple', 32 codePoint: 99 33 } 34 ] 35}
A capturing group can be referenced in the pattern using notation of an escaped group number.
Matches abab
string:
(ab)\1
A node:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Group', 6 capturing: true, 7 number: 1, 8 expression: { 9 type: 'Alternative', 10 expressions: [ 11 { 12 type: 'Char', 13 value: 'a', 14 symbol: 'a', 15 kind: 'simple', 16 codePoint: 97 17 }, 18 { 19 type: 'Char', 20 value: 'b', 21 symbol: 'b', 22 kind: 'simple', 23 codePoint: 98 24 } 25 ] 26 } 27 }, 28 { 29 type: 'Backreference', 30 kind: 'number', 31 number: 1, 32 reference: 1, 33 } 34 ] 35}
A named capturing group can be accessed using \k<name>
pattern, and also using a numbered reference.
Matches www
:
1(?<foo>w)\k<foo>\1
A node:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Group', 6 capturing: true, 7 name: 'foo', 8 nameRaw: 'foo', 9 number: 1, 10 expression: { 11 type: 'Char', 12 value: 'w', 13 symbol: 'w', 14 kind: 'simple', 15 codePoint: 119 16 } 17 }, 18 { 19 type: 'Backreference', 20 kind: 'name', 21 number: 1, 22 reference: 'foo', 23 referenceRaw: 'foo' 24 }, 25 { 26 type: 'Backreference', 27 kind: 'number', 28 number: 1, 29 reference: 1 30 } 31 ] 32}
Note: The referenceRaw
property represents the reference as parsed from the original source, including escape sequences. The reference
property represents the canonical decoded form of the reference.
For example, given the /u
flag and the following pattern (matches www
):
1(?<π>w)\k<\u{03C0}>\1
We would have the following node:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Group', 6 capturing: true, 7 name: 'π', 8 nameRaw: 'π', 9 number: 1, 10 expression: { 11 type: 'Char', 12 value: 'w', 13 symbol: 'w', 14 kind: 'simple', 15 codePoint: 119 16 } 17 }, 18 { 19 type: 'Backreference', 20 kind: 'name', 21 number: 1, 22 reference: 'π', 23 referenceRaw: '\\u{03C0}' 24 }, 25 { 26 type: 'Backreference', 27 kind: 'number', 28 number: 1, 29 reference: 1 30 } 31 ] 32}
Quantifiers specify repetition of a regular expression (or of its part). Below are the quantifiers which wrap a parsed expression into a Repetition
node. The quantifier itself can be of different kinds, and has Quantifier
node type.
The ?
quantifier is short for {0,1}
.
a?
Node:
1{ 2 type: 'Repetition', 3 expression: { 4 type: 'Char', 5 value: 'a', 6 symbol: 'a', 7 kind: 'simple', 8 codePoint: 97 9 }, 10 quantifier: { 11 type: 'Quantifier', 12 kind: '?', 13 greedy: true 14 } 15}
The *
quantifier is short for {0,}
.
a*
Node:
1{ 2 type: 'Repetition', 3 expression: { 4 type: 'Char', 5 value: 'a', 6 symbol: 'a', 7 kind: 'simple', 8 codePoint: 97 9 }, 10 quantifier: { 11 type: 'Quantifier', 12 kind: '*', 13 greedy: true 14 } 15}
The +
quantifier is short for {1,}
.
// Same as `aa*`, or `a{1,}`
a+
Node:
1{ 2 type: 'Repetition', 3 expression: { 4 type: 'Char', 5 value: 'a', 6 symbol: 'a', 7 kind: 'simple', 8 codePoint: 97 9 }, 10 quantifier: { 11 type: 'Quantifier', 12 kind: '+', 13 greedy: true 14 } 15}
Explicit range-based quantifiers are parsed as follows:
a{3}
The type of the quantifier is Range
, and from
, and to
properties have the same value:
1{ 2 type: 'Repetition', 3 expression: { 4 type: 'Char', 5 value: 'a', 6 symbol: 'a', 7 kind: 'simple', 8 codePoint: 97 9 }, 10 quantifier: { 11 type: 'Quantifier', 12 kind: 'Range', 13 from: 3, 14 to: 3, 15 greedy: true 16 } 17}
An open range doesn't have max value (assuming semantic "more", or Infinity value):
a{3,}
An AST node for such range doesn't contain to
property:
1{ 2 type: 'Repetition', 3 expression: { 4 type: 'Char', 5 value: 'a', 6 symbol: 'a', 7 kind: 'simple', 8 codePoint: 97 9 }, 10 quantifier: { 11 type: 'Quantifier', 12 kind: 'Range', 13 from: 3, 14 greedy: true 15 } 16}
A closed range has explicit max value: (which syntactically can be the same as min value):
a{3,5}
// Same as a{3}
a{3,3}
An AST node for a closed range:
1{ 2 type: 'Repetition', 3 expression: { 4 type: 'Char', 5 value: 'a', 6 symbol: 'a', 7 kind: 'simple', 8 codePoint: 97 9 }, 10 quantifier: { 11 type: 'Quantifier', 12 kind: 'Range', 13 from: 3, 14 to: 5, 15 greedy: true 16 } 17}
NOTE: it is a syntax error if the max value is less than min value:
/a{3,2}/
If any quantifier is followed by the ?
, the quantifier becomes non-greedy.
Example:
a+?
Node:
1{ 2 type: 'Repetition', 3 expression: { 4 type: 'Char', 5 value: 'a', 6 symbol: 'a', 7 kind: 'simple', 8 codePoint: 97 9 }, 10 quantifier: { 11 type: 'Quantifier', 12 kind: '+', 13 greedy: false 14 } 15}
Other examples:
a??
a*?
a{1}?
a{1,}?
a{1,3}?
Assertions appear as separate AST nodes, however instread of manipulating on the characters themselves, they assert certain conditions of a matching string. Examples: ^
-- beginning of a string (or a line in multiline mode), $
-- end of a string, etc.
The ^
assertion checks whether a scanner is at the beginning of a string (or a line in multiline mode).
In the example below ^
is not a property of the a
symbol, but a separate AST node for the assertion. The parsed node is actually an Alternative
with two nodes:
^a
The node:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Assertion', 6 kind: '^' 7 }, 8 { 9 type: 'Char', 10 value: 'a', 11 symbol: 'a', 12 kind: 'simple', 13 codePoint: 97 14 } 15 ] 16}
Since assertion is a separate node, it may appear anywhere in the matching string. The following regexp is completely valid, and asserts beginning of the string; it'll match an empty string:
^^^^^
The $
assertion is similar to ^
, but asserts the end of a string (or a line in a multiline mode):
a$
A node:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Char', 6 value: 'a', 7 symbol: 'a', 8 kind: 'simple', 9 codePoint: 97 10 }, 11 { 12 type: 'Assertion', 13 kind: '$' 14 } 15 ] 16}
And again, this is a completely valid regexp, and matches an empty string:
^^^^$$$$$
// valid too:
$^
The \b
assertion check for word boundary, i.e. the position between a word and a space.
Matches x
in x y
, but not in xy
:
x\b
A node:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Char', 6 value: 'x', 7 symbol: 'x', 8 kind: 'simple', 9 codePoint: 120 10 }, 11 { 12 type: 'Assertion', 13 kind: '\\b' 14 } 15 ] 16}
The \B
is vice-versa checks for non-word boundary. The following example matches x
in xy
, but not in x y
:
x\B
A node is the same:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Char', 6 value: 'x', 7 symbol: 'x', 8 kind: 'simple', 9 codePoint: 120 10 }, 11 { 12 type: 'Assertion', 13 kind: '\\B' 14 } 15 ] 16}
These assertions check whether a pattern is followed (or not followed for the negative assertion) by another pattern.
Matches a
only if it's followed by b
:
a(?=b)
A node:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Char', 6 value: 'a', 7 symbol: 'a', 8 kind: 'simple', 9 codePoint: 97 10 }, 11 { 12 type: 'Assertion', 13 kind: 'Lookahead', 14 assertion: { 15 type: 'Char', 16 value: 'b', 17 symbol: 'b', 18 kind: 'simple', 19 codePoint: 98 20 } 21 } 22 ] 23}
Matches a
only if it's not followed by b
:
a(?!b)
A node is similar, just negative
flag is added:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Char', 6 value: 'a', 7 symbol: 'a', 8 kind: 'simple', 9 codePoint: 97 10 }, 11 { 12 type: 'Assertion', 13 kind: 'Lookahead', 14 negative: true, 15 assertion: { 16 type: 'Char', 17 value: 'b', 18 symbol: 'b', 19 kind: 'simple', 20 codePoint: 98 21 } 22 } 23 ] 24}
NOTE: Lookbehind assertions are not yet supported by JavaScript RegExp. It is an ECMAScript proposal which is at stage 3 at the moment.
These assertions check whether a pattern is preceded (or not preceded for the negative assertion) by another pattern.
Matches b
only if it's preceded by a
:
(?<=a)b
A node:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Assertion', 6 kind: 'Lookbehind', 7 assertion: { 8 type: 'Char', 9 value: 'a', 10 symbol: 'a', 11 kind: 'simple', 12 codePoint: 97 13 } 14 }, 15 { 16 type: 'Char', 17 value: 'b', 18 symbol: 'b', 19 kind: 'simple', 20 codePoint: 98 21 }, 22 ] 23}
Matches b
only if it's not preceded by a
:
(?<!a)b
A node:
1{ 2 type: 'Alternative', 3 expressions: [ 4 { 5 type: 'Assertion', 6 kind: 'Lookbehind', 7 negative: true, 8 assertion: { 9 type: 'Char', 10 value: 'a', 11 symbol: 'a', 12 kind: 'simple', 13 codePoint: 97 14 } 15 }, 16 { 17 type: 'Char', 18 value: 'b', 19 symbol: 'b', 20 kind: 'simple', 21 codePoint: 98 22 }, 23 ] 24}
No vulnerabilities found.
Reason
no binaries found in the repo
Reason
license file detected
Details
Reason
Found 4/7 approved changesets -- score normalized to 5
Reason
0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0
Reason
no effort to earn an OpenSSF best practices badge detected
Reason
security policy file not detected
Details
Reason
project is not fuzzed
Details
Reason
branch protection not enabled on development/release branches
Details
Reason
SAST tool is not run on all commits -- score normalized to 0
Details
Reason
11 existing vulnerabilities detected
Details
Score
Last Scanned on 2024-11-25
The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.
Learn More