Skip to content

Commit 3b7e6b3

Browse files
committed
docs - add regex design
1 parent 225a5c4 commit 3b7e6b3

File tree

1 file changed

+60
-0
lines changed

1 file changed

+60
-0
lines changed

dev/design/REGEX_ALTERNATIVES.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# Open-source regex engines to look at
2+
3+
---
4+
5+
### 1. **PCRE2 (C)**
6+
7+
* **Closest match to Perl semantics.**
8+
Supports subroutine calls, recursion, verbs, backtracking control, `\K`, `\G`, named captures, etc.
9+
* **Mature and very fast.**
10+
Highly optimized, tested in production everywhere.
11+
* **Drawback:** written in C. You’d either need a JNI bridge (performance overhead + complexity) or port its VM design to Java/bytecode.
12+
* **Good fit if:** you want an authoritative reference for how the opcodes and backtracking model can look. Even just studying PCRE2’s IR and backtracking stack gives you a roadmap.
13+
14+
---
15+
16+
### 2. **Oniguruma / Onigmo (C, Ruby’s regex)**
17+
18+
* **Used by Ruby.**
19+
Similar richness: subroutines, conditionals, callouts, verb-like constructs.
20+
* **Unicode-aware.**
21+
Strong support for multibyte encodings.
22+
* **Same issue:** C codebase → would require port or JNI.
23+
* **Good fit if:** you want a more modular codebase than PCRE2 and are interested in multi-encoding support.
24+
25+
---
26+
27+
### 3. **RE2 (C++, Google)**
28+
29+
* **Non-backtracking engine.**
30+
Designed for safety and guaranteed linear time.
31+
* **Not a fit** if your goal is Perl fidelity, since RE2 *intentionally omits* Perl features like backreferences, recursion, or verbs.
32+
* **Good fit if:** you wanted a “safe mode” for sandboxing untrusted regex, but not for Perl emulation.
33+
34+
---
35+
36+
### 4. **Joni (Java, JRuby project)**
37+
38+
* **Oniguruma port to Java.**
39+
Used by JRuby to give Ruby-like regex semantics on the JVM.
40+
* **Already integrates with JVM.**
41+
No JNI bridge needed; you can inspect and possibly adapt its VM design.
42+
* **Not Perl-complete.** But it’s the closest JVM-side codebase with backtracking and features beyond `java.util.regex`.
43+
* **Good fit if:** you want something you can drop into PerlOnJava as a starting point and then extend toward Perl’s semantics.
44+
45+
---
46+
47+
### 5. **PCRE-J (various Java ports)**
48+
49+
* Some partial PCRE ports exist in Java, though many are outdated and incomplete.
50+
* **Good fit if:** you find a maintained one, but otherwise riskier than Joni.
51+
52+
---
53+
54+
### Strategy for PerlOnJava
55+
56+
* **Short term:** leverage Joni. It’s Java, battle-tested (JRuby depends on it), and already has a backtracking model + subroutines. You can embed it, then progressively extend its instruction set to cover Perl-only constructs (`(?{ })`, `(??{ })`, cut verbs).
57+
* **Mid term:** study PCRE2 as the “gold standard” and port the missing instructions (the C VM maps very directly to Java bytecode).
58+
* **Long term:** you’ll probably end up with your own forked/extended VM specialized for PerlOnJava, but you save months of design by standing on Joni/PCRE2 first.
59+
60+

0 commit comments

Comments
 (0)