|
| 1 | +# Open-source regex engines to look at |
| 2 | + |
| 3 | +--- |
| 4 | + |
| 5 | +### 1. **PCRE2 (C)** |
| 6 | + |
| 7 | +* **Closest match to Perl semantics.** |
| 8 | + Supports subroutine calls, recursion, verbs, backtracking control, `\K`, `\G`, named captures, etc. |
| 9 | +* **Mature and very fast.** |
| 10 | + Highly optimized, tested in production everywhere. |
| 11 | +* **Drawback:** written in C. You’d either need a JNI bridge (performance overhead + complexity) or port its VM design to Java/bytecode. |
| 12 | +* **Good fit if:** you want an authoritative reference for how the opcodes and backtracking model can look. Even just studying PCRE2’s IR and backtracking stack gives you a roadmap. |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +### 2. **Oniguruma / Onigmo (C, Ruby’s regex)** |
| 17 | + |
| 18 | +* **Used by Ruby.** |
| 19 | + Similar richness: subroutines, conditionals, callouts, verb-like constructs. |
| 20 | +* **Unicode-aware.** |
| 21 | + Strong support for multibyte encodings. |
| 22 | +* **Same issue:** C codebase → would require port or JNI. |
| 23 | +* **Good fit if:** you want a more modular codebase than PCRE2 and are interested in multi-encoding support. |
| 24 | + |
| 25 | +--- |
| 26 | + |
| 27 | +### 3. **RE2 (C++, Google)** |
| 28 | + |
| 29 | +* **Non-backtracking engine.** |
| 30 | + Designed for safety and guaranteed linear time. |
| 31 | +* **Not a fit** if your goal is Perl fidelity, since RE2 *intentionally omits* Perl features like backreferences, recursion, or verbs. |
| 32 | +* **Good fit if:** you wanted a “safe mode” for sandboxing untrusted regex, but not for Perl emulation. |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +### 4. **Joni (Java, JRuby project)** |
| 37 | + |
| 38 | +* **Oniguruma port to Java.** |
| 39 | + Used by JRuby to give Ruby-like regex semantics on the JVM. |
| 40 | +* **Already integrates with JVM.** |
| 41 | + No JNI bridge needed; you can inspect and possibly adapt its VM design. |
| 42 | +* **Not Perl-complete.** But it’s the closest JVM-side codebase with backtracking and features beyond `java.util.regex`. |
| 43 | +* **Good fit if:** you want something you can drop into PerlOnJava as a starting point and then extend toward Perl’s semantics. |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +### 5. **PCRE-J (various Java ports)** |
| 48 | + |
| 49 | +* Some partial PCRE ports exist in Java, though many are outdated and incomplete. |
| 50 | +* **Good fit if:** you find a maintained one, but otherwise riskier than Joni. |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +### Strategy for PerlOnJava |
| 55 | + |
| 56 | +* **Short term:** leverage Joni. It’s Java, battle-tested (JRuby depends on it), and already has a backtracking model + subroutines. You can embed it, then progressively extend its instruction set to cover Perl-only constructs (`(?{ })`, `(??{ })`, cut verbs). |
| 57 | +* **Mid term:** study PCRE2 as the “gold standard” and port the missing instructions (the C VM maps very directly to Java bytecode). |
| 58 | +* **Long term:** you’ll probably end up with your own forked/extended VM specialized for PerlOnJava, but you save months of design by standing on Joni/PCRE2 first. |
| 59 | + |
| 60 | + |
0 commit comments