Gathering detailed insights and metrics for refa
Gathering detailed insights and metrics for refa
Gathering detailed insights and metrics for refa
Gathering detailed insights and metrics for refa
A library for finite automata and regular expressions in the context of JS RegExp
npm install refa
Module System
Min. Node Version
Typescript Support
Node Version
NPM Version
22 Stars
502 Commits
3 Forks
2 Watching
3 Branches
4 Contributors
Updated on 23 Nov 2024
TypeScript (99.79%)
JavaScript (0.21%)
Cumulative downloads
Total Downloads
Last day
-6.6%
124,015
Compared to previous day
Last week
5%
665,704
Compared to previous week
Last month
26.4%
2,650,010
Compared to previous month
Last year
232.5%
14,941,629
Compared to previous year
1
23
A library for regular expressions (RE) and finite automata (FA) in the context of Javascript RegExp.
refa is a general library for DFA, NFA, and REs of formal regular languages. It also includes methods to easily convert from JS RegExp to the internal RE AST and vice versa.
Get refa from NPM:
npm i --save refa
or
yarn add refa
Conversions
DFA, NFA, and ENFA operations
DFA specific operations
NFA and ENFA specific operations
AST transformations
JavaScript RegExp
See the API documentation for a complete list of all currently implemented operations.
refa uses its own AST format to represent regular expressions. The RE AST format is language agnostic and relatively simple.
It supports:
ab
)a|b
)a{4,6}
, a{2,}?
, a?
, a*
)(?=a)
, (?<!a)
)Some features like atomic groups and capturing groups are not supported (but might be added in the future).
For information on how to parse JS RegExp and convert RE AST to JS RegExp, see the JS
namespace.
refa does not use JavaScript strings represent characters or a sequences of characters. Instead it uses integers to represent characters (see the Char
type) and arrays of numbers to represent words/strings (see the Word
type).
This means that any text encoding can be used.
The Words
namespace contains functions to convert JavaScript data into refa-compatible words and characters.
For the sets of characters, the CharSet
class is used.
This library will never be able to support some modern features of regex engines such as backreferences and recursion because these features, generally, cannot be be represented by a DFA or NFA.
refa is a relatively low-level library. It only provides the basic building blocks. In the following examples, JS RegExps are used a lot so we will define a few useful helper function beforehand.
1import { DFA, FiniteAutomaton, JS, NFA } from "refa";
2
3function toNFA(regex: RegExp): NFA {
4 const { expression, maxCharacter } = JS.Parser.fromLiteral(regex).parse();
5 return NFA.fromRegex(expression, { maxCharacter });
6}
7function toDFA(regex: RegExp): DFA {
8 return DFA.fromFA(toNFA(regex));
9}
10function toRegExp(fa: FiniteAutomaton): RegExp {
11 const literal = JS.toLiteral(fa.toRegex());
12 return new RegExp(literal.source, literal.flags);
13}
toNFA
parses the given RegExp and constructs a new NFA from the parsed AST.toDFA
constructs a new NFA from the RegExp first and then converts that NFA into a new DFA.toRegex
takes an FA (= NFA or DFA) and converts it into a RegExp.1import { Words } from "refa";
2
3const regex = /\w+\d+/;
4const nfa = toNFA(regex);
5
6console.log(nfa.test(Words.fromStringToUTF16("abc")));
7// => false
8console.log(nfa.test(Words.fromStringToUTF16("123")));
9// => true
10console.log(nfa.test(Words.fromStringToUTF16("abc123")));
11// => true
12console.log(nfa.test(Words.fromStringToUTF16("123abc")));
13// => false
1const regex1 = /a+B+c+/i;
2const regex2 = /Ab*C\d?/;
3
4const intersection = NFA.fromIntersection(toNFA(regex1), toNFA(regex2));
5
6console.log(toRegExp(intersection));
7// => /Ab+C/
1const regex = /a+b*/i; 2 3const dfa = toDFA(regex); 4dfa.complement(); 5 6console.log(toRegExp(dfa)); 7// => /(?:(?:[^A]|A+(?:[^AB]|B+[^B]))[^]*)?/i
In the above examples, we have been using the toNFA
helper function to parse and convert RegExps. This function assumes that the given RegExp is a pure regular expression without assertions and backreferences and will throw an error if the assumption is not met.
However, the JS parser and NFA.fromRegex
provide some options to work around and even solve this problem.
Firstly, the parser will automatically resolve simple backreferences. Even toNFA
will do this since it's on by default:
1console.log(toRegExp(toNFA(/("|').*?\1/))); 2// => /".*"|'.*'/i
But it will throw an error for non-trivial backreferences that cannot be resolved:
1toNFA(/(#+).*\1|foo/); 2// Error: Backreferences are not supported.
The only way to parse the RegExp despite unresolvable backreferences is to remove the backreferences. This means that the result will be imperfect but it might still be useful.
1const regex = /(#+).*\1|foo/; 2const { expression } = 3 JS.Parser.fromLiteral(regex).parse({ backreferences: "disable" }); 4 5console.log(JS.toLiteral(expression)); 6// => { source: 'foo', flags: '' }
Note that the foo
alternative is kept because it is completely unaffected by the unresolvable backreferences.
While the parser and AST format can handle assertions, the NFA construction cannot.
1const regex = /\b(?!\d)\w+\b|->/;
2const { expression, maxCharacter } = JS.Parser.fromLiteral(regex).parse();
3
4console.log(JS.toLiteral(expression));
5// => { source: '\\b(?!\\d)\\w+\\b|->', flags: 'i' }
6
7NFA.fromRegex(expression, { maxCharacter });
8// Error: Assertions are not supported yet.
Similarly to backreferences, we can let the parser remove them:
1const regex = /\b(?!\d)\w+\b|->/;
2const { expression, maxCharacter } =
3 JS.Parser.fromLiteral(regex).parse({ assertions: "disable" });
4
5console.log(JS.toLiteral(expression));
6// => { source: '->', flags: 'i' }
7
8const nfa = NFA.fromRegex(expression, { maxCharacter });
9console.log(toRegExp(nfa));
10// => /->/i
Or we can let the NFA construction method remove them:
1const regex = /\b(?!\d)\w+\b|->/;
2const { expression, maxCharacter } = JS.Parser.fromLiteral(regex).parse();
3
4console.log(JS.toLiteral(expression));
5// => { source: '\\b(?!\\d)\\w+\\b|->', flags: 'i' }
6
7const nfa = NFA.fromRegex(expression, { maxCharacter }, { assertions: "disable" });
8console.log(toRegExp(nfa));
9// => /->/i
Prefer using the parser to remove assertions if possible. The parser is quite clever and will optimize based on that assertions can be removed resulting in faster parse times.
However, simply removing assertions is not ideal since they are a lot more common than backreferences. To work around this, refa has AST transformers. AST transformers can make changes to a given AST. While each transformer is rather simple, they can also work together to accomplish more complex tasks. Applying and removing assertions is one such task.
The simplest transformer to remove assertions (among other things) is the simplify
transformer. It will inline expressions, remove dead branches, apply/remove assertions, optimize quantifiers, and more.
1import { JS, NFA, Transformers, transform } from "refa";
2
3const regex = /\b(?!\d)\w+\b|->/;
4const { expression, maxCharacter } = JS.Parser.fromLiteral(regex).parse();
5console.log(JS.toLiteral(expression));
6// => { source: '\\b(?!\\d)\\w+\\b|->', flags: '' }
7
8const modifiedExpression = transform(Transformers.simplify(), expression);
9console.log(JS.toLiteral(modifiedExpression));
10// => { source: '(?<!\\w)[A-Z_]\\w*(?!\\w)|->', flags: 'i' }
11
12// Most assertions have been removed but the patterns are still equivalent.
13// The only assertions left assert characters beyond the edge of the pattern.
14// Removing those assertions is easy but slightly changes the pattern.
15
16const finalExpression = transform(Transformers.patternEdgeAssertions({ remove: true }), modifiedExpression);
17console.log(JS.toLiteral(finalExpression));
18// => { source: '[A-Z_]\\w*|->', flags: 'i' }
19
20const nfa = NFA.fromRegex(finalExpression, { maxCharacter });
21console.log(JS.toLiteral(nfa.toRegex()));
22// => { source: '->|[A-Z_]\\w*', flags: 'i' }
AST transformers can handle a lot of assertions, but there are limitations. Transformers cannot handle assertions that are too complex or require large-scale changes to the AST. Let's take a look at a few examples:
1import { JS, Transformers, transform } from "refa"; 2 3function simplify(regex: RegExp): void { 4 const { expression } = JS.Parser.fromLiteral(regex).parse(); 5 6 const simplifiedExpression = transform(Transformers.simplify(), expression); 7 8 const literal = JS.toLiteral(simplifiedExpression); 9 console.log(new RegExp(literal.source, literal.flags)); 10} 11 12simplify(/\b(?!\d)\b\w+\b\s*\(/); 13// => /(?<!\w)[A-Z_]\w*\s*\(/i 14simplify(/(?:^|@)\b\w+\b/); 15// => /(?:^|@)\w+(?!\w)/ 16simplify(/"""(?:(?!""").)*"""/s); 17// => /"""(?:"{0,2}[^"])*"""/ 18simplify(/"""((?!""")(?:[^\\]|\\"))*"""/); 19// => /"""(?:"{0,2}(?:[^"\\]|\\"))*"""/ 20simplify(/<title>(?:(?!<\/title>).)*<\/title>/s); 21// => /<title>(?:[^<]|<+(?:[^/<]|\/(?!title>)))*<+\/title>/ 22simplify(/^```$.*?^```$/ms); 23// => /^```[\n\r\u2028\u2029](?:[^]*?[\n\r\u2028\u2029])??```$/m
Transformers.simplify
is very aggressive when it comes to assertions. It will try to remove assertions whenever possible even if it means that the overall AST will become more complex (within some limits). This may result in longer/more complex regexes, but it will also allow NFA
and ENFA
to support many more regexes.
No vulnerabilities found.
Reason
no dangerous workflow patterns detected
Reason
license file detected
Details
Reason
no binaries found in the repo
Reason
dependency not pinned by hash detected -- score normalized to 5
Details
Reason
7 existing vulnerabilities detected
Details
Reason
0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0
Reason
Found 0/29 approved changesets -- score normalized to 0
Reason
no effort to earn an OpenSSF best practices badge detected
Reason
security policy file not detected
Details
Reason
detected GitHub workflow tokens with excessive permissions
Details
Reason
SAST tool is not run on all commits -- score normalized to 0
Details
Reason
project is not fuzzed
Details
Score
Last Scanned on 2024-11-25
The Open Source Security Foundation is a cross-industry collaboration to improve the security of open source software (OSS). The Scorecard provides security health metrics for open source projects.
Learn More