Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 17 additions & 11 deletions doc/ref/chap-expr-lang.md
Original file line number Diff line number Diff line change
Expand Up @@ -774,24 +774,30 @@ Negation always uses `!`

### re-chars

Oils usually invokes `libc` in UTF-8 mode. In this mode, the regex engine
can't match bytes like `0xFF`; it can only match code points.
Oils usually invokes the `libc` POSIX extended regex (ERE) engine in the UTF-8
unicode mode, which generally matches by "code points".

var x = / [ \y7F \u{3bc} ] / # a byte and a code point

Oils translates Eggex to POSIX extended regex (ERE) syntax. Here are some
restrictions when translating bytes and code points to ERE:
This mode is backwards compatible only with the 7-bit ASCII character byte
codes 1-127:

- The `NUL` byte `\y00` isn't allowed.
- The `NUL` byte `\y00` isn't allowed in ERE.
- Its synonym, code point zero `\u{0}`, also isn't allowed.
- Bytes `\y80` to `\yFF` aren't allowed, because they're outside the ASCII
range.
- Bytes `\y80` to `\yFF` aren't allowed, because they're used to encode
code points in UTF-8.

The dissallowed byte range allows to match the vast amount of characters
defined by the variable-width UTF-8 "code points".

var x = / [ \y7F \u{3bc} ] / # a byte and a code point


Reminders:

- In the ASCII range, bytes and code points are the same
- UTF code points could re-use the character byte encodings of the globally
standardized "low-byte" ASCII range 0-127, so they are the same.
- That is, `\y01` to `\y7F` are synonyms for `\u{1}` to `\u{7F}`.
- Outside of the ASCII range, they are different, so Eggex disallows them.
- Outside that range, there exist(ed) different encodings, so UTF-8
disallowed those bytes in favor of globally unique UTF-8 code points.
- For example, `\u{FF}` is a code point, and `\yFF` is a byte, but they are
not the same.

Expand Down