ICU-23004 C++ Unicode string code point iterators #3096

markusicu · 2024-08-12T22:01:33Z

New C++ header-only APIs for iterating over the Unicode code points in a Unicode string, and more generally over the code units from a code unit iterator. These are modern C++ equivalents of some of the long-standing C macros for iterating over UTF-8 and UTF-16. This C++ API also supports UTF-32.

FYI: UTF-8 and UTF-16 encode code points with variable-length code unit sequences. A validating iterator needs to read and check all of the code units for one code point. When a code unit sequence is ill-formed, then the returned subsequence must be a prefix of a well-formed sequence. (Except we always return at least one code unit, so that we always progress.) (UTF-32 still has validation, but sequences always have length one.)

The API can read code units from a C++ input_iterator or forward_iterator or bidirectional_iterator. The latter includes code unit pointers like const char * and const char16_t *. There is a convenience API for std::string_views.

The main class is called UTFIterator. Its operator*() returns a value serving a variety of use cases: Class CodeUnits provides the code point, the start of its minimal subsequence, the number of code units, and whether they are well-formed. (All functions are declared inline. An optimizing compiler will usually omit fields that are not used, and the code for computing them.)

UTFIterator has the API of a C++ STL iterator. It has template parameters for the code unit iterator type, for the code point type, and for how to handle ill-formed subsequences. std::make_reverse_iterator works for making reverse-range iterators.

The convenience class UTFStringCodePoints turns a std::string_view (of variable code unit type) into a code point iteration “range” with begin()/end()/rbegin()/rend() functions.

There are convenience functions utfIterator() and utfStringCodePoints() to simplify call sites; they deduce the code unit and base iterator types.

For each of these classes and convenience functions, there is also an “unsafe” version, just like for the C macros. The normal versions validate the code unit sequences. The “unsafe” ones assume/require that the strings/sequences are well-formed. As a result, they yield much smaller and faster code.

Checklist

ALLOW_MANY_COMMITS=true

jira-pull-request-webhook · 2024-12-23T22:26:35Z

Notice: the branch changed across the force-push!

icu4c/source/common/unicode/utf16cppiter.h is different
icu4c/source/common/unicode/uversion.h is no longer changed in the branch
icu4c/source/test/intltest/intltest.vcxproj is different
icu4c/source/test/intltest/intltest.vcxproj.filters is different
icu4c/source/test/intltest/itutil.cpp is different
icu4c/source/test/intltest/Makefile.in is different
icu4c/source/test/intltest/utfcppitertest.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

markusicu · 2024-12-26T20:50:15Z

Hi @eggrobin I think this is worth taking another look. I rebased on recent main, made changes from our discussions, and I think this looks roughly like a reasonable validating, forward-only (so far) Unicode 16-bit-string code point iterator. It no longer tries to be clever: It no longer reads & validates the code point while iterating, and no longer stores the result in the iterator.

Plenty of TODOs and questions left, but I would appreciate feedback on the shape of what I've got so far.

markusicu · 2025-01-03T00:00:04Z

I experimented with godbolt, and found that the compiler does its best fusing operator*() and operator++() when they both call the same implementation function. This makes operator++() look horribly inefficient, but the machine code for a regular range-based for loop from the optimizing clang 19 looks very concise.

I then also made the iterator bidirectional and added a special version for efficient rbegin() & rend() using the same principles.

The bidirectional iterator also exposes explicit but non-colloquial functions.

eggrobin · 2025-01-03T00:01:36Z

(As noted over email, I’ll take a look on Monday when I’m back from the holidays.)

…loads

markusicu · 2025-04-05T03:28:05Z

@eggrobin @richgillam I think I might be done. 🎉

I have added a lot of test code and am out of ideas for what else we need.
Not much has changed in the implementation since our tour.

Both the new header file and the test code still contain the experimental/sample code. I will remove those after you give me a green light. (I will then of course need a re-rubber-stamp.)

Please review again.

You could go commit by commit since our tour... but there are a lot of commits, and some back & forth of adding code and then deduplicating/simplifying.
Probably best to look at the whole delta.
Let me know if you want another tour.

I also intend to make a copy of my branch before squashing. Try to remind me of that...

markusicu · 2025-04-07T21:48:36Z

@roubert in case you have time: I added you back as a reviewer as well.
The implementation code has not changed a lot since you last looked.
Most of the recent changes are in the test code.
And you are more familiar with modern C++ than Rich.

icu4c/source/common/unicode/utfiterator.h

markusicu · 2025-04-10T17:46:17Z

The ICU-TC today approved my API amendment.

iterator default constructors
switching each of the string-code-points factory functions from tparam StringView to 5 string_view overloads

I would really like to move forward here and get this work merged -- and try to use it for real.
If there is no severe feedback, then I want to remove the experimental & sample code, make a backup, and ask for approval, even if it's a rubber stamp.
I am willing to make changes later for good feedback.
I already have a short list of things to add or try.

richgillam

I'll be honest: My eyes started to cross midway through the unit tests. But I think the implementation code looks good (as far as I could tell, anyway).

icu4c/source/test/intltest/utfiteratortest.cpp

- remove conditional experimental code - remove a TODO that we may or may not revisit later (noted elsewhere) - remove sample code from the test file - remove commented-out printf()s

markusicu · 2025-04-11T18:29:49Z

@eggrobin @richgillam I just committed my cleanup. I think it's ready to go, so please re-approve.

As for backup, Robin suggests that I squash-and-merge this PR, temporarily allowing that and being careful to prefix the commit message as required. That will avoid confusion with duplicate PRs while keeping the pre-squash-n-merge commit history in place.

markusicu assigned eggrobin Aug 12, 2024

markusicu added 3 commits December 23, 2024 14:14

U16Iterator experiment

74e9b6f

U16Iterator op*() returns U16OneSeq

6568b04

header-only

1bcd5ee

markusicu force-pushed the utfcppiter branch from 2ce8c27 to 1bcd5ee Compare December 23, 2024 22:26

markusicu added 5 commits December 23, 2024 16:55

operator* read on the fly

20f890b

fix hdrtest

7dc31d2

U16IllFormedBehavior

b381c2b

C++ range: U16StringCodePoints

6851e8d

template param: code point type

64ea110

markusicu added 3 commits December 26, 2024 17:19

make it work outside of ICU

7bbeefc

experimental sample code

43e99e0

pre=post-inc, fused readAndInc()

bfc722e

Shakattack76 approved these changes Dec 27, 2024

View reviewed changes

markusicu added 3 commits January 2, 2025 13:59

readAndInc() for all

c156434

bidirectional

e0cf8f7

efficient rbegin() & rend()

ca4787e

markusicu added 8 commits January 2, 2025 16:11

doxygen tparam

a24b710

remove non-standard iter API

633fafa

C enum UIllFormedBehavior will be shared with 8-bit

70ef2fa

CodeUnits result will be shared with 8-bit

da93999

CodeUnits: getters / private fields

5c6e1a6

unsafe=well-formed iterators

84dc5f4

restore base dec() (oops)

8bea75e

rename to utfiter.h, also test

5281d61

markusicu changed the title ~~experiment with UTF-8/16 C++ iterators~~ ICU-23004: experiment with UTF-8/16 C++ iterators Jan 7, 2025

markusicu added 11 commits April 1, 2025 16:48

ICU-23004 test unsafe bidi iter

255969a

ICU-23004 start implementation code coverage tests

916466c

ICU-23004 testSafeLongLinear(bidi, bad)

e1b8bd6

ICU-23004 test fn impl: more deduction, simpler call sites

3ec9f03

ICU-23004 clearer+shorter with dont-care test constants

b03914f

ICU-23004 fix utfStringCodePoints(): StringView -> 5 string_view over…

6fb4eca

…loads

ICU-23004 shared testSinglePassIter() impl

e127460

ICU-23004 testLongLinearInput()

c1e5e92

ICU-23004 testLongLinearFwd()

b360c07

ICU-23004 testLongBackward()

ebfff4e

ICU-23004 testLongReverse()

c1fae66

markusicu changed the title ~~ICU-23004 experiment with UTF-8/16 C++ iterators~~ ICU-23004 C++ code point iterators over Unicode strings Apr 4, 2025

markusicu added 2 commits April 4, 2025 15:08

ICU-23004 couple more bad UTF-8 cases

2f11923

ICU-23004 test iterator state: zigzag

a02226b

markusicu changed the title ~~ICU-23004 C++ code point iterators over Unicode strings~~ ICU-23004 C++ Unicode string code point iterators Apr 5, 2025

markusicu requested review from richgillam, eggrobin and roubert April 7, 2025 21:46

eggrobin reviewed Apr 8, 2025

View reviewed changes

icu4c/source/common/unicode/utfiterator.h Outdated Show resolved Hide resolved

markusicu requested a review from eggrobin April 8, 2025 22:16

richgillam previously approved these changes Apr 11, 2025

View reviewed changes

eggrobin reviewed Apr 11, 2025

View reviewed changes

icu4c/source/test/intltest/utfiteratortest.cpp Outdated Show resolved Hide resolved

ICU-23004 cleanup

85abfec

- remove conditional experimental code - remove a TODO that we may or may not revisit later (noted elsewhere) - remove sample code from the test file - remove commented-out printf()s

markusicu dismissed richgillam’s stale review via 85abfec April 11, 2025 18:27

eggrobin approved these changes Apr 11, 2025

View reviewed changes

markusicu merged commit 40fb3a9 into unicode-org:main Apr 11, 2025
93 of 94 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU-23004 C++ Unicode string code point iterators #3096

ICU-23004 C++ Unicode string code point iterators #3096

markusicu commented Aug 12, 2024 •

edited

Loading

jira-pull-request-webhook bot commented Dec 23, 2024

markusicu commented Dec 26, 2024

markusicu commented Jan 3, 2025

eggrobin commented Jan 3, 2025

markusicu commented Apr 5, 2025

markusicu commented Apr 7, 2025

markusicu commented Apr 10, 2025

richgillam left a comment

markusicu commented Apr 11, 2025

ICU-23004 C++ Unicode string code point iterators #3096

ICU-23004 C++ Unicode string code point iterators #3096

Conversation

markusicu commented Aug 12, 2024 • edited Loading

Checklist

jira-pull-request-webhook bot commented Dec 23, 2024

markusicu commented Dec 26, 2024

markusicu commented Jan 3, 2025

eggrobin commented Jan 3, 2025

markusicu commented Apr 5, 2025

markusicu commented Apr 7, 2025

markusicu commented Apr 10, 2025

richgillam left a comment

Choose a reason for hiding this comment

markusicu commented Apr 11, 2025

markusicu commented Aug 12, 2024 •

edited

Loading