-
Notifications
You must be signed in to change notification settings - Fork 2
Does this m4 implementation work with UTF-8 files? #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am not actually sure. Many OpenBSD utilities accept UTF-8, but I've never used m4 with non-ASCII characters. |
I suspect it's not ready for UTF-8, because the key point is when moving from one character to the next. In ASCII, this move is 1 byte. In UTF-8 it can be 1 byte, or 2, or 3, or 4, depending on the bytes values. I haven't found in the code this distinction when moving across characters. Now, most of the times it will work correctly, but it will break if the value of any of the bytes that form a multibyte UTF-8 character matches for example a quoting character, a parens, or any token that confuses m4. Also, another point is, how would m4 tell if the input is ASCII or UTF-8? If it's done looking at the current LOCALE, it could perhaps break weird/obfuscated m4 files if they happen to have unprintable values that map to UTF-8 multibyte characters (I have no idea if anybody uses so weird m4 files, but it could happen). On the other hand, if the input encoding is chosen with a command line argument instead of using the LOCALE, it might be considered a non-elegant solution (although, IIRC, some compilers have such a flag). For the moment, I have managed to isolate my m4 files so that they are ASCII clean. But one of the features I wish to add in my m4 macros implies processing C sources that can (and will) have UTF-8 characters. Now, if the "move to next character" is the only point where m4 can break, and if that's done in only one place in the code, patching it could be straightforward. But if the code assumes in several places that this can be done by looking at the next byte, the patch would be more complicated. IMHO, it's a pity that m4 has been used so little, because it has the functionality you miss in the C preprocessor whenever you are doing metaprogramming and you don't want to (or can't) use C++. Had m4 been used more, it would now fully support UTF-8, because current compilers use UTF-8 encoding by default for input sources nowadays. |
Oops, disregard this last comment: For a moment I forgot that all the bytes in a UTF-8 character are in the form 10xxxxxx, so they are above US ASCII, and no clashing should occur. So, in theory, if the input is UTF-8 but your m4 macros are written in US ASCII, everything should work fine. A different thing is if you use UTF-8 in macros... I'm not sure if that would work, but I don't need that either. I'll give it a try and I'll report back. |
Hi!
I want to use m4 as a preprocessor for some C source code files, but these files have some UTF-8 characters in them (in C comments, and in string literals). Is this m4 implementation ready for processing this kind of files, or should I expect problems? (Note: I'm looking for a non-copyleft m4 implementation).
Thanks a lot!
The text was updated successfully, but these errors were encountered: