Skip to content

ICU-23004 C++ Unicode string code point iterators #3096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 126 commits into from
Apr 11, 2025

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Aug 12, 2024

New C++ header-only APIs for iterating over the Unicode code points in a Unicode string, and more generally over the code units from a code unit iterator. These are modern C++ equivalents of some of the long-standing C macros for iterating over UTF-8 and UTF-16. This C++ API also supports UTF-32.

FYI: UTF-8 and UTF-16 encode code points with variable-length code unit sequences. A validating iterator needs to read and check all of the code units for one code point. When a code unit sequence is ill-formed, then the returned subsequence must be a prefix of a well-formed sequence. (Except we always return at least one code unit, so that we always progress.) (UTF-32 still has validation, but sequences always have length one.)

The API can read code units from a C++ input_iterator or forward_iterator or bidirectional_iterator. The latter includes code unit pointers like const char * and const char16_t *. There is a convenience API for std::string_views.

The main class is called UTFIterator. Its operator*() returns a value serving a variety of use cases: Class CodeUnits provides the code point, the start of its minimal subsequence, the number of code units, and whether they are well-formed. (All functions are declared inline. An optimizing compiler will usually omit fields that are not used, and the code for computing them.)

UTFIterator has the API of a C++ STL iterator. It has template parameters for the code unit iterator type, for the code point type, and for how to handle ill-formed subsequences. std::make_reverse_iterator works for making reverse-range iterators.

The convenience class UTFStringCodePoints turns a std::string_view (of variable code unit type) into a code point iteration “range” with begin()/end()/rbegin()/rend() functions.

There are convenience functions utfIterator() and utfStringCodePoints() to simplify call sites; they deduce the code unit and base iterator types.

For each of these classes and convenience functions, there is also an “unsafe” version, just like for the C macros. The normal versions validate the code unit sequences. The “unsafe” ones assume/require that the strings/sequences are well-formed. As a result, they yield much smaller and faster code.

Checklist
  • Rename utfiter.h to utfiterator.h as proposed, unless the TC decides on a different name.
  • Discuss API with TC, make changes as requested.
  • Finish API docs.
  • More comprehensive testing / API & implementation coverage.
  • Squash...
  • Required: Issue filed: ICU-23004
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

ALLOW_MANY_COMMITS=true

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/unicode/utf16cppiter.h is different
  • icu4c/source/common/unicode/uversion.h is no longer changed in the branch
  • icu4c/source/test/intltest/intltest.vcxproj is different
  • icu4c/source/test/intltest/intltest.vcxproj.filters is different
  • icu4c/source/test/intltest/itutil.cpp is different
  • icu4c/source/test/intltest/Makefile.in is different
  • icu4c/source/test/intltest/utfcppitertest.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@markusicu
Copy link
Member Author

Hi @eggrobin I think this is worth taking another look. I rebased on recent main, made changes from our discussions, and I think this looks roughly like a reasonable validating, forward-only (so far) Unicode 16-bit-string code point iterator. It no longer tries to be clever: It no longer reads & validates the code point while iterating, and no longer stores the result in the iterator.

Plenty of TODOs and questions left, but I would appreciate feedback on the shape of what I've got so far.

@markusicu
Copy link
Member Author

I experimented with godbolt, and found that the compiler does its best fusing operator*() and operator++() when they both call the same implementation function. This makes operator++() look horribly inefficient, but the machine code for a regular range-based for loop from the optimizing clang 19 looks very concise.

I then also made the iterator bidirectional and added a special version for efficient rbegin() & rend() using the same principles.

The bidirectional iterator also exposes explicit but non-colloquial functions.

@eggrobin
Copy link
Member

eggrobin commented Jan 3, 2025

(As noted over email, I’ll take a look on Monday when I’m back from the holidays.)

@markusicu markusicu changed the title experiment with UTF-8/16 C++ iterators ICU-23004: experiment with UTF-8/16 C++ iterators Jan 7, 2025
@markusicu markusicu changed the title ICU-23004 experiment with UTF-8/16 C++ iterators ICU-23004 C++ code point iterators over Unicode strings Apr 4, 2025
@markusicu markusicu changed the title ICU-23004 C++ code point iterators over Unicode strings ICU-23004 C++ Unicode string code point iterators Apr 5, 2025
@markusicu
Copy link
Member Author

@eggrobin @richgillam I think I might be done. 🎉

I have added a lot of test code and am out of ideas for what else we need.
Not much has changed in the implementation since our tour.

Both the new header file and the test code still contain the experimental/sample code. I will remove those after you give me a green light. (I will then of course need a re-rubber-stamp.)

Please review again.

You could go commit by commit since our tour... but there are a lot of commits, and some back & forth of adding code and then deduplicating/simplifying.
Probably best to look at the whole delta.
Let me know if you want another tour.

I also intend to make a copy of my branch before squashing. Try to remind me of that...

@markusicu
Copy link
Member Author

@roubert in case you have time: I added you back as a reviewer as well.
The implementation code has not changed a lot since you last looked.
Most of the recent changes are in the test code.
And you are more familiar with modern C++ than Rich.

@markusicu markusicu requested a review from eggrobin April 8, 2025 22:16
@markusicu
Copy link
Member Author

The ICU-TC today approved my API amendment.

  • iterator default constructors
  • switching each of the string-code-points factory functions from tparam StringView to 5 string_view overloads

I would really like to move forward here and get this work merged -- and try to use it for real.
If there is no severe feedback, then I want to remove the experimental & sample code, make a backup, and ask for approval, even if it's a rubber stamp.
I am willing to make changes later for good feedback.
I already have a short list of things to add or try.

richgillam
richgillam previously approved these changes Apr 11, 2025
Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest: My eyes started to cross midway through the unit tests. But I think the implementation code looks good (as far as I could tell, anyway).

- remove conditional experimental code
- remove a TODO that we may or may not revisit later (noted elsewhere)
- remove sample code from the test file
- remove commented-out printf()s
@markusicu
Copy link
Member Author

@eggrobin @richgillam I just committed my cleanup. I think it's ready to go, so please re-approve.

As for backup, Robin suggests that I squash-and-merge this PR, temporarily allowing that and being careful to prefix the commit message as required. That will avoid confusion with duplicate PRs while keeping the pre-squash-n-merge commit history in place.

@markusicu markusicu merged commit 40fb3a9 into unicode-org:main Apr 11, 2025
93 of 94 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants