-
-
Notifications
You must be signed in to change notification settings - Fork 787
ICU-23004 C++ Unicode string code point iterators #3096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2ce8c27
to
1bcd5ee
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Hi @eggrobin I think this is worth taking another look. I rebased on recent main, made changes from our discussions, and I think this looks roughly like a reasonable validating, forward-only (so far) Unicode 16-bit-string code point iterator. It no longer tries to be clever: It no longer reads & validates the code point while iterating, and no longer stores the result in the iterator. Plenty of TODOs and questions left, but I would appreciate feedback on the shape of what I've got so far. |
I experimented with godbolt, and found that the compiler does its best fusing operator*() and operator++() when they both call the same implementation function. This makes operator++() look horribly inefficient, but the machine code for a regular range-based for loop from the optimizing clang 19 looks very concise. I then also made the iterator bidirectional and added a special version for efficient rbegin() & rend() using the same principles. The bidirectional iterator also exposes explicit but non-colloquial functions. |
(As noted over email, I’ll take a look on Monday when I’m back from the holidays.) |
@eggrobin @richgillam I think I might be done. 🎉 I have added a lot of test code and am out of ideas for what else we need. Both the new header file and the test code still contain the experimental/sample code. I will remove those after you give me a green light. (I will then of course need a re-rubber-stamp.) Please review again. You could go commit by commit since our tour... but there are a lot of commits, and some back & forth of adding code and then deduplicating/simplifying. I also intend to make a copy of my branch before squashing. Try to remind me of that... |
@roubert in case you have time: I added you back as a reviewer as well. |
The ICU-TC today approved my API amendment.
I would really like to move forward here and get this work merged -- and try to use it for real. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest: My eyes started to cross midway through the unit tests. But I think the implementation code looks good (as far as I could tell, anyway).
- remove conditional experimental code - remove a TODO that we may or may not revisit later (noted elsewhere) - remove sample code from the test file - remove commented-out printf()s
@eggrobin @richgillam I just committed my cleanup. I think it's ready to go, so please re-approve. As for backup, Robin suggests that I squash-and-merge this PR, temporarily allowing that and being careful to prefix the commit message as required. That will avoid confusion with duplicate PRs while keeping the pre-squash-n-merge commit history in place. |
New C++ header-only APIs for iterating over the Unicode code points in a Unicode string, and more generally over the code units from a code unit iterator. These are modern C++ equivalents of some of the long-standing C macros for iterating over UTF-8 and UTF-16. This C++ API also supports UTF-32.
FYI: UTF-8 and UTF-16 encode code points with variable-length code unit sequences. A validating iterator needs to read and check all of the code units for one code point. When a code unit sequence is ill-formed, then the returned subsequence must be a prefix of a well-formed sequence. (Except we always return at least one code unit, so that we always progress.) (UTF-32 still has validation, but sequences always have length one.)
The API can read code units from a C++ input_iterator or forward_iterator or bidirectional_iterator. The latter includes code unit pointers like
const char *
andconst char16_t *
. There is a convenience API forstd::string_view
s.The main class is called
UTFIterator
. Itsoperator*()
returns a value serving a variety of use cases: ClassCodeUnits
provides the code point, the start of its minimal subsequence, the number of code units, and whether they are well-formed. (All functions are declared inline. An optimizing compiler will usually omit fields that are not used, and the code for computing them.)UTFIterator
has the API of a C++ STL iterator. It has template parameters for the code unit iterator type, for the code point type, and for how to handle ill-formed subsequences.std::make_reverse_iterator
works for making reverse-range iterators.The convenience class
UTFStringCodePoints
turns astd::string_view
(of variable code unit type) into a code point iteration “range” with begin()/end()/rbegin()/rend() functions.There are convenience functions
utfIterator()
andutfStringCodePoints()
to simplify call sites; they deduce the code unit and base iterator types.For each of these classes and convenience functions, there is also an “unsafe” version, just like for the C macros. The normal versions validate the code unit sequences. The “unsafe” ones assume/require that the strings/sequences are well-formed. As a result, they yield much smaller and faster code.
Checklist
ALLOW_MANY_COMMITS=true