Properly escape string literals with non-ascii characters #952

WardBrian · 2021-08-24T19:23:37Z

Currently, string literals are escaped with %S formatting, which turns non-printing characters into a 3-long escape sequence like \012. However, this escaping is in decimal, while C++ expects an escape sequence \123 to be in octal. By copying most of the ocaml stdlib's escaping code, but replacing the relevant section to output octal sequences instead of decimal, string literals will now properly pass to the C++ code.

In theory this allows you to write print/reject statements in non-English character sets, in practice I only tested it on some emojis and Cyrillic characters. The existing behavior led to some very strange results being printed, but after this change it printed as expected (on a UTF-supporting terminal, anyway)

This was mostly a joke between @bob-carpenter and I -- he said if I added emojis to stan, it would get a blog post. But, it is a relatively simple change and doesn't require any upkeep/support, because strings so limited in their use.

Submission Checklist

Run unit tests
Documentation
- If a user-facing facing change was made, the documentation PR is here: Update encoding notices for string literals docs#388

Release notes

Provide rough support for non-ASCII characters in string literals.

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)

WardBrian · 2021-08-25T14:33:31Z

I'm not sure what is causing the test failures here. Those models compile for me locally, and in particular none of the existing cpp output is changed by this PR

rok-cesnovar · 2021-08-25T14:41:54Z

That is unrelated to this PR, the error happens on master too. See #944 (comment)

Its been fixed but it needs a few more hours for the Math changes to propagate up to Cmdstan. Sorry about that.

WardBrian · 2021-08-25T14:49:54Z

Ah, I'm just glad to hear it wasn't caused by this - that would have been deeply confusing

SteveBronder · 2021-08-25T18:13:03Z

Indirectly related to this PR, C++ source code is allowed to be unicode, though I think this would play weirdly with our upstream service API users via R Python etc.

https://godbolt.org/z/jK6nq57rc

WardBrian · 2021-08-25T18:28:05Z

I think supporting it in source code would be a much bigger endeavor. Part of the reason this change was so simple is we are actually already correctly lexing/tokenizing these strings, so all I needed to do was make them output in the way C++ expects. Plus, strings are such a minor part of the language that it is ultimately a low-impact change.

If we wanted them as identifiers in the language I think we would need to use a different lexing library like sedlex which actually supports unicode fully

SteveBronder · 2021-08-25T18:39:52Z

Yeah totally out of the range of this PR, just an interesting thought

bob-carpenter · 2021-08-25T20:12:21Z

The main attraction of Unicode characters in Stan programs would be to allow the following.

β ~ normal(0, σ);

If we allowed that, the Julia devs might stop belittling Stan for only supporting ASCII. 😄 Of course, as soon as I threw down the emoji, I realized what we're likely to get is this:

🙂 ~ ☺️_lpdf(😄);

WardBrian · 2021-08-26T14:42:05Z

@rok-cesnovar - any idea when the stan-math fix will definitely have propagated? I'm still seeing errors on those two models

rok-cesnovar · 2021-08-26T14:46:53Z

Hey, as soon as this run in stan-dev/stan finishes: https://jenkins.mc-stan.org/blue/organizations/jenkins/Stan/detail/develop/932/

It should finish in about an hour (barring anything unexpected). We had a bit of a backlog of tests yesterday so this took a bit longer…

rok-cesnovar · 2021-08-26T17:36:55Z

Finished now, I restarted the tests here.

WardBrian · 2021-08-27T15:58:53Z

Tests seem good. If we actually want this feature I will quickly write up a doc change - any opinions?

bob-carpenter · 2021-08-27T16:06:08Z

Yes, we'd definitely like to be able to do the right thing with unicode print and reject statements. The doc just needs to revise the character encoding discussion in the reference manual. And, of course, I'll write a hello emoji blog post to which everyone will ask if we can use unicode identifiers.

WardBrian · 2021-08-27T16:32:49Z

If you want a brief, non-technical answer to that request: This PR is so simple because it only ensures we escape non-ascii characters in a way C++ understands, thereby enabling the user to use any encoding that their editor and terminal support. For us to use said characters outside of strings, we would need to actually care about the encoding in a hands-on way.

bob-carpenter · 2021-08-27T20:51:36Z

I think we'll just take any old byte stream in the input and preserve it. If it happens to correspond to UTF-8 characters and you have a console, etc. that will render that, then great.

WardBrian · 2021-08-30T14:16:16Z

I've updated the doc PR to more or less say that and avoid the question of 'what is a character'

WardBrian · 2021-08-31T15:41:48Z

Docs have been merged so this should be good to go

nhuurre

Looks good but could you add a test file test/integration/good/code-gen/print_unicode.stan with a couple of examples.

WardBrian · 2021-09-09T19:53:26Z

test file test/integration/good/code-gen/print_unicode.stan

Done! Here is what I added:

transformed data {
  print("test: Љ😃");
  print("λ β ζ π");
}

(file encoded in UTF-8)

and the expected output segment:

      current_statement__ = 1;
      if (pstream__) {
        stan_print(pstream__, "test: \320\211\360\237\230\203");
        stan_print(pstream__, "\n");
      }
      current_statement__ = 2;
      if (pstream__) {
        stan_print(pstream__, "\316\273 \316\262 \316\266 \317\200");
        stan_print(pstream__, "\n");
      }

You can check that those are the correct escape sequences, and running the program prints as expected on my UTF-8 supporting terminal

Properly escape string literals with non-ascii characters

447d1bd

WardBrian mentioned this pull request Aug 27, 2021

Update encoding notices for string literals stan-dev/docs#388

Merged

2 tasks

WardBrian requested a review from SteveBronder August 31, 2021 14:57

WardBrian requested a review from rok-cesnovar September 8, 2021 16:51

WardBrian mentioned this pull request Sep 9, 2021

Release 2.28 checklist stan-dev/cmdstan#1037

Closed

23 tasks

nhuurre requested changes Sep 9, 2021

View reviewed changes

Add UTF-8 printing test

13b353d

nhuurre approved these changes Sep 10, 2021

View reviewed changes

WardBrian added 2 commits September 10, 2021 09:23

Merge branch 'master' of github.com:stan-dev/stanc3 into escape-unicode

e38bf58

Fix test error

76430a2

WardBrian merged commit 23b7151 into stan-dev:master Sep 10, 2021

WardBrian deleted the escape-unicode branch September 10, 2021 14:54

WardBrian mentioned this pull request Feb 15, 2024

Experimental support for unicode identifiers. #1407

Closed

2 tasks

WardBrian mentioned this pull request Mar 10, 2025

Experimental support for unicode identifiers #1499

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly escape string literals with non-ascii characters #952

Properly escape string literals with non-ascii characters #952

WardBrian commented Aug 24, 2021 •

edited

Loading

WardBrian commented Aug 25, 2021

rok-cesnovar commented Aug 25, 2021

WardBrian commented Aug 25, 2021

SteveBronder commented Aug 25, 2021

WardBrian commented Aug 25, 2021

SteveBronder commented Aug 25, 2021

bob-carpenter commented Aug 25, 2021

WardBrian commented Aug 26, 2021

rok-cesnovar commented Aug 26, 2021 •

edited

Loading

rok-cesnovar commented Aug 26, 2021

WardBrian commented Aug 27, 2021

bob-carpenter commented Aug 27, 2021

WardBrian commented Aug 27, 2021

bob-carpenter commented Aug 27, 2021

WardBrian commented Aug 30, 2021

WardBrian commented Aug 31, 2021

nhuurre left a comment

WardBrian commented Sep 9, 2021

Properly escape string literals with non-ascii characters #952

Properly escape string literals with non-ascii characters #952

Conversation

WardBrian commented Aug 24, 2021 • edited Loading

Submission Checklist

Release notes

Copyright and Licensing

WardBrian commented Aug 25, 2021

rok-cesnovar commented Aug 25, 2021

WardBrian commented Aug 25, 2021

SteveBronder commented Aug 25, 2021

WardBrian commented Aug 25, 2021

SteveBronder commented Aug 25, 2021

bob-carpenter commented Aug 25, 2021

WardBrian commented Aug 26, 2021

rok-cesnovar commented Aug 26, 2021 • edited Loading

rok-cesnovar commented Aug 26, 2021

WardBrian commented Aug 27, 2021

bob-carpenter commented Aug 27, 2021

WardBrian commented Aug 27, 2021

bob-carpenter commented Aug 27, 2021

WardBrian commented Aug 30, 2021

WardBrian commented Aug 31, 2021

nhuurre left a comment

Choose a reason for hiding this comment

WardBrian commented Sep 9, 2021

WardBrian commented Aug 24, 2021 •

edited

Loading

rok-cesnovar commented Aug 26, 2021 •

edited

Loading