Does this m4 implementation work with UTF-8 files? #3

cesss · 2020-08-28T21:30:35Z

Hi!

I want to use m4 as a preprocessor for some C source code files, but these files have some UTF-8 characters in them (in C comments, and in string literals). Is this m4 implementation ready for processing this kind of files, or should I expect problems? (Note: I'm looking for a non-copyleft m4 implementation).

Thanks a lot!

ibara · 2020-08-30T14:29:44Z

I am not actually sure. Many OpenBSD utilities accept UTF-8, but I've never used m4 with non-ASCII characters.
I would be interested in a try and report back.

cesss · 2020-08-31T09:18:32Z

I suspect it's not ready for UTF-8, because the key point is when moving from one character to the next. In ASCII, this move is 1 byte. In UTF-8 it can be 1 byte, or 2, or 3, or 4, depending on the bytes values. I haven't found in the code this distinction when moving across characters.

Now, most of the times it will work correctly, but it will break if the value of any of the bytes that form a multibyte UTF-8 character matches for example a quoting character, a parens, or any token that confuses m4.

Also, another point is, how would m4 tell if the input is ASCII or UTF-8? If it's done looking at the current LOCALE, it could perhaps break weird/obfuscated m4 files if they happen to have unprintable values that map to UTF-8 multibyte characters (I have no idea if anybody uses so weird m4 files, but it could happen). On the other hand, if the input encoding is chosen with a command line argument instead of using the LOCALE, it might be considered a non-elegant solution (although, IIRC, some compilers have such a flag).

For the moment, I have managed to isolate my m4 files so that they are ASCII clean. But one of the features I wish to add in my m4 macros implies processing C sources that can (and will) have UTF-8 characters.

Now, if the "move to next character" is the only point where m4 can break, and if that's done in only one place in the code, patching it could be straightforward. But if the code assumes in several places that this can be done by looking at the next byte, the patch would be more complicated.

IMHO, it's a pity that m4 has been used so little, because it has the functionality you miss in the C preprocessor whenever you are doing metaprogramming and you don't want to (or can't) use C++. Had m4 been used more, it would now fully support UTF-8, because current compilers use UTF-8 encoding by default for input sources nowadays.

cesss · 2020-08-31T11:09:50Z

Oops, disregard this last comment: For a moment I forgot that all the bytes in a UTF-8 character are in the form 10xxxxxx, so they are above US ASCII, and no clashing should occur. So, in theory, if the input is UTF-8 but your m4 macros are written in US ASCII, everything should work fine. A different thing is if you use UTF-8 in macros... I'm not sure if that would work, but I don't need that either.

I'll give it a try and I'll report back.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does this m4 implementation work with UTF-8 files? #3

Does this m4 implementation work with UTF-8 files? #3

cesss commented Aug 28, 2020

ibara commented Aug 30, 2020

cesss commented Aug 31, 2020 •

edited

Loading

cesss commented Aug 31, 2020

Does this m4 implementation work with UTF-8 files? #3

Does this m4 implementation work with UTF-8 files? #3

Comments

cesss commented Aug 28, 2020

ibara commented Aug 30, 2020

cesss commented Aug 31, 2020 • edited Loading

cesss commented Aug 31, 2020

cesss commented Aug 31, 2020 •

edited

Loading