Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly escape string literals with non-ascii characters #952

Merged
merged 4 commits into from
Sep 10, 2021

Conversation

WardBrian
Copy link
Member

@WardBrian WardBrian commented Aug 24, 2021

Currently, string literals are escaped with %S formatting, which turns non-printing characters into a 3-long escape sequence like \012. However, this escaping is in decimal, while C++ expects an escape sequence \123 to be in octal. By copying most of the ocaml stdlib's escaping code, but replacing the relevant section to output octal sequences instead of decimal, string literals will now properly pass to the C++ code.

In theory this allows you to write print/reject statements in non-English character sets, in practice I only tested it on some emojis and Cyrillic characters. The existing behavior led to some very strange results being printed, but after this change it printed as expected (on a UTF-supporting terminal, anyway)

This was mostly a joke between @bob-carpenter and I -- he said if I added emojis to stan, it would get a blog post. But, it is a relatively simple change and doesn't require any upkeep/support, because strings so limited in their use.

Submission Checklist

Release notes

Provide rough support for non-ASCII characters in string literals.

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)

@WardBrian
Copy link
Member Author

I'm not sure what is causing the test failures here. Those models compile for me locally, and in particular none of the existing cpp output is changed by this PR

@rok-cesnovar
Copy link
Member

That is unrelated to this PR, the error happens on master too. See #944 (comment)

Its been fixed but it needs a few more hours for the Math changes to propagate up to Cmdstan. Sorry about that.

@WardBrian
Copy link
Member Author

Ah, I'm just glad to hear it wasn't caused by this - that would have been deeply confusing

@SteveBronder
Copy link
Contributor

Indirectly related to this PR, C++ source code is allowed to be unicode, though I think this would play weirdly with our upstream service API users via R Python etc.

https://godbolt.org/z/jK6nq57rc

@WardBrian
Copy link
Member Author

I think supporting it in source code would be a much bigger endeavor. Part of the reason this change was so simple is we are actually already correctly lexing/tokenizing these strings, so all I needed to do was make them output in the way C++ expects. Plus, strings are such a minor part of the language that it is ultimately a low-impact change.

If we wanted them as identifiers in the language I think we would need to use a different lexing library like sedlex which actually supports unicode fully

@SteveBronder
Copy link
Contributor

Yeah totally out of the range of this PR, just an interesting thought

@bob-carpenter
Copy link
Member

The main attraction of Unicode characters in Stan programs would be to allow the following.

β ~ normal(0, σ);

If we allowed that, the Julia devs might stop belittling Stan for only supporting ASCII. 😄 Of course, as soon as I threw down the emoji, I realized what we're likely to get is this:

🙂 ~ ☺️_lpdf(😄);

@WardBrian
Copy link
Member Author

@rok-cesnovar - any idea when the stan-math fix will definitely have propagated? I'm still seeing errors on those two models

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Aug 26, 2021

Hey, as soon as this run in stan-dev/stan finishes: https://jenkins.mc-stan.org/blue/organizations/jenkins/Stan/detail/develop/932/

It should finish in about an hour (barring anything unexpected). We had a bit of a backlog of tests yesterday so this took a bit longer…

@rok-cesnovar
Copy link
Member

Finished now, I restarted the tests here.

@WardBrian
Copy link
Member Author

Tests seem good. If we actually want this feature I will quickly write up a doc change - any opinions?

@bob-carpenter
Copy link
Member

Yes, we'd definitely like to be able to do the right thing with unicode print and reject statements. The doc just needs to revise the character encoding discussion in the reference manual. And, of course, I'll write a hello emoji blog post to which everyone will ask if we can use unicode identifiers.

@WardBrian
Copy link
Member Author

If you want a brief, non-technical answer to that request: This PR is so simple because it only ensures we escape non-ascii characters in a way C++ understands, thereby enabling the user to use any encoding that their editor and terminal support. For us to use said characters outside of strings, we would need to actually care about the encoding in a hands-on way.

@bob-carpenter
Copy link
Member

I think we'll just take any old byte stream in the input and preserve it. If it happens to correspond to UTF-8 characters and you have a console, etc. that will render that, then great.

@WardBrian
Copy link
Member Author

I've updated the doc PR to more or less say that and avoid the question of 'what is a character'

@WardBrian WardBrian requested a review from SteveBronder August 31, 2021 14:57
@WardBrian
Copy link
Member Author

Docs have been merged so this should be good to go

Copy link
Collaborator

@nhuurre nhuurre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but could you add a test file test/integration/good/code-gen/print_unicode.stan with a couple of examples.

@WardBrian
Copy link
Member Author

test file test/integration/good/code-gen/print_unicode.stan

Done! Here is what I added:

transformed data {
  print("test: Љ😃");
  print("λ β ζ π");
}

(file encoded in UTF-8)

and the expected output segment:

      current_statement__ = 1;
      if (pstream__) {
        stan_print(pstream__, "test: \320\211\360\237\230\203");
        stan_print(pstream__, "\n");
      }
      current_statement__ = 2;
      if (pstream__) {
        stan_print(pstream__, "\316\273 \316\262 \316\266 \317\200");
        stan_print(pstream__, "\n");
      }

You can check that those are the correct escape sequences, and running the program prints as expected on my UTF-8 supporting terminal

@WardBrian WardBrian merged commit 23b7151 into stan-dev:master Sep 10, 2021
@WardBrian WardBrian deleted the escape-unicode branch September 10, 2021 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants