-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly escape string literals with non-ascii characters #952
Conversation
I'm not sure what is causing the test failures here. Those models compile for me locally, and in particular none of the existing cpp output is changed by this PR |
That is unrelated to this PR, the error happens on master too. See #944 (comment) Its been fixed but it needs a few more hours for the Math changes to propagate up to Cmdstan. Sorry about that. |
Ah, I'm just glad to hear it wasn't caused by this - that would have been deeply confusing |
Indirectly related to this PR, C++ source code is allowed to be unicode, though I think this would play weirdly with our upstream service API users via R Python etc. |
I think supporting it in source code would be a much bigger endeavor. Part of the reason this change was so simple is we are actually already correctly lexing/tokenizing these strings, so all I needed to do was make them output in the way C++ expects. Plus, strings are such a minor part of the language that it is ultimately a low-impact change. If we wanted them as identifiers in the language I think we would need to use a different lexing library like sedlex which actually supports unicode fully |
Yeah totally out of the range of this PR, just an interesting thought |
The main attraction of Unicode characters in Stan programs would be to allow the following.
If we allowed that, the Julia devs might stop belittling Stan for only supporting ASCII. 😄 Of course, as soon as I threw down the emoji, I realized what we're likely to get is this:
|
@rok-cesnovar - any idea when the stan-math fix will definitely have propagated? I'm still seeing errors on those two models |
Hey, as soon as this run in stan-dev/stan finishes: https://jenkins.mc-stan.org/blue/organizations/jenkins/Stan/detail/develop/932/ It should finish in about an hour (barring anything unexpected). We had a bit of a backlog of tests yesterday so this took a bit longer… |
Finished now, I restarted the tests here. |
Tests seem good. If we actually want this feature I will quickly write up a doc change - any opinions? |
Yes, we'd definitely like to be able to do the right thing with unicode print and reject statements. The doc just needs to revise the character encoding discussion in the reference manual. And, of course, I'll write a hello emoji blog post to which everyone will ask if we can use unicode identifiers. |
If you want a brief, non-technical answer to that request: This PR is so simple because it only ensures we escape non-ascii characters in a way C++ understands, thereby enabling the user to use any encoding that their editor and terminal support. For us to use said characters outside of strings, we would need to actually care about the encoding in a hands-on way. |
I think we'll just take any old byte stream in the input and preserve it. If it happens to correspond to UTF-8 characters and you have a console, etc. that will render that, then great. |
I've updated the doc PR to more or less say that and avoid the question of 'what is a character' |
Docs have been merged so this should be good to go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but could you add a test file test/integration/good/code-gen/print_unicode.stan
with a couple of examples.
Done! Here is what I added: transformed data {
print("test: Љ😃");
print("λ β ζ π");
} (file encoded in UTF-8) and the expected output segment: current_statement__ = 1;
if (pstream__) {
stan_print(pstream__, "test: \320\211\360\237\230\203");
stan_print(pstream__, "\n");
}
current_statement__ = 2;
if (pstream__) {
stan_print(pstream__, "\316\273 \316\262 \316\266 \317\200");
stan_print(pstream__, "\n");
} You can check that those are the correct escape sequences, and running the program prints as expected on my UTF-8 supporting terminal |
Currently, string literals are escaped with
%S
formatting, which turns non-printing characters into a 3-long escape sequence like\012
. However, this escaping is in decimal, while C++ expects an escape sequence\123
to be in octal. By copying most of the ocaml stdlib's escaping code, but replacing the relevant section to output octal sequences instead of decimal, string literals will now properly pass to the C++ code.In theory this allows you to write print/reject statements in non-English character sets, in practice I only tested it on some emojis and Cyrillic characters. The existing behavior led to some very strange results being printed, but after this change it printed as expected (on a UTF-supporting terminal, anyway)
This was mostly a joke between @bob-carpenter and I -- he said if I added emojis to stan, it would get a blog post. But, it is a relatively simple change and doesn't require any upkeep/support, because strings so limited in their use.
Submission Checklist
Release notes
Provide rough support for non-ASCII characters in string literals.
Copyright and Licensing
By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)