Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] [proposal] String, ASCII, Unicode, UTF, Graphemes #3988

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

martinvuyk
Copy link
Contributor

This is a proposal on the main goals. Followed by a proposed approach to get there.

I would like us to reach a consensus on this topic given it will affect many projects, and involve a lot of work to fully support.

Everyone involved with strings or that should be part of this conversation I can think of: @JoeLoser @ConnorGray @lsh @jackos @mzaks @bgreni @thatstoasty @leb-kuchen

Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@martinvuyk martinvuyk changed the title String, ASCII, Unicode, UTF, Graphemes [stdlib] [proposal] String, ASCII, Unicode, UTF, Graphemes Feb 5, 2025
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@gryznar
Copy link
Contributor

gryznar commented Feb 5, 2025

I like this proposal :)


#### Hold off on developing Char further and remove it from stdlib.builtin

`Char` is currently expensive to create and use compared to a `StringSlice`
Copy link
Contributor

@gabrieldemarmiesse gabrieldemarmiesse Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have numbers on this? The creation but also using different methods on it. Let's avoid optimizing things without data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have numbers on this?

I don't need numbers to know that the bit-shifting, masking, and for loops going on in decoding utf32 from utf8 is expensive compared to a pointer and a length which is what Span is and what StringSlice uses underneath.

also using different methods on it

Comparing a 16 byte long SIMD vector is going to be more expensive* than using count leading zeros (most CPUs have a specialized circuit) and bitwise or-ing (1 micro-op) with a comparison to ascii-max (see #3896).

*: In the context in which this function is used, where the number of bytes for a sequence is a prerequisite for a lot of follow-up code, where the throughput advantage of SIMD is not realized given its latency stalls the pipeline. I have done benchmarking and found such cases in #3697 and #3528

Let's avoid optimizing things without data.

A pointer and a length is always going to be less expensive than transforming data, when an algorithmic throughput difference is not part of the equation. This could be the case for example when transforming to another mathematical plane to avoid solving differential equations. But it is not the case here IMO.

@ConnorGray
Copy link
Collaborator

Hi Martin, thanks for taking the time to write up a proposal on this topic! 🙂

Apologies for the length of this response—"if I'd had more time, I would have written a shorter letter" and all that😌

Before responding to your proposal, let me share a bit out where my head is at regarding how we handle these string processing APIs. For some context on my recent work in this area, we had had a discussion internally on the stdlib team and came to a tentative consensus to move forward with the name Char / char_* for types that operate on Unicode codepoints, with the understanding that we could rename if needed for clarity. It sounds like the current name may not be ideal, and so we're open to renaming it 🙂

My current thinking is a modification to what @owenhilyard proposed in Discord here:

I'm somewhat torn. For most people, they seem to want to iterate over symbols in text, graphemes. However, that can have a substantial perf cost. Char also has potential for confusion with c_char, since c_char is usually a byte. What if we kept the name of the functions as .characters() but make the function return a Grapheme iterator, but also provided .codepoints() and .bytes()? I personally prefer to make the API most people will reach for the most correct in terms of what most people want (iteration over symbols they can see), and then if you have perf issues you can use codepoints or bytes directly.

More directly, I'm currently thinking should do the following:

  • Rename Char to Codepoint for minimal ambiguity
  • Keep StringSlice.as_bytes() -> Span[Byte]
  • Rename .chars() to StringSlice.codepoints() -> CodepointsIter (iterator over Codepoint)
  • Rename .char_slices() to StringSlice.codepoint_slices() (iterator over single-codepoint StringSlice pointers)
  • Eventually introduce a grapheme cluster iterator called .characters() or .graphemes(), along with Character (owning) and CharacterSlice (view) types.
    • (If we're feeling particularly bold, perhaps we'd call those Grapheme and GraphemeSlice—but that might be too unfamiliar to folks 🙃.)

Then, eventually, re-add a StringSlice.__iter__() that either calls .codepoint_slices() or .graphemes(), pending discussion about the merits of each of those (performance vs correctness tradeoff, essentially).


Now some initial thoughts on your proposal 🙂

They [Swift] have a Character that I think inspired our current Char type, it is a generic representation of a Character in any encoding, that can be from one codepoint up to any grapheme cluster.

Minor clarification: This is correct re. Swift's Character containing one or more clustered codepoints, but our Char type I added was actually inspired by Rust's char type, which is a unsigned 32 bit integer. I hadn't intended for our Char to eventually support stored an unbounded amount of codepoints, because that would have meant it would need to allocate, which wasn't optimal for the common cases where iterating over codepoints is sufficient.

To that end, that's partly why I'm in favor of renaming our Char type to Codepoint. Char has the advantage of being "recognizable" as an English-ish word, but I've come around to the opinion that we're better off using API naming that more closely adheres to the underlying model and inherent complexity of UTF-8 and Unicode, instead of trying to mask that complexity by pretending that a "Character" and Unicode scalar value are equivalent.

Using terminology like Codepoint might be unfamiliar at first to folks not already knowledgeable about how computers represent and process text. But I think part of Mojo's broader language philosophy is not shying away from and trying to paper over complexity. Instead, I think it's better to both teach and make that complexity feel manageable with APIs that model it well 🙂

Value vs. Reference

Our current Char type uses a u32 as storage, every time an iterator that
yields Char is used, an instance is parsed from the internal UTF-8 encoded
StringSlice (into UTF-32).

The default iterator for String returns a StringSlice which is a view into
the character in the UTF-8 encoded StringSlice. This is much more efficient
and does not add any complexity into the type system nor developer headspace.

I think both types of iteration (over decoded u32 codepoints, and over single-codepoint StringSlice views) are useful, and supporting both is valuable. I agree that iteration over single-character StringSlice is likely the most common for any kind of general-purpose string processing. But for certain specialized use cases, e.g. within parsers, low-level string manipulation algorithms, character encoding converters, etc. being able to compare codepoint values directly can be the most natural way to express the desired logic.

As a simple example, our Unicode case conversion logic maps a single Char codepoint to a sequence of replacement Char that are the uppercase expression of the original Char. That is an algorithm most naturally expressed in terms of single codepoint values, because the Unicode casing tables powering it are based on codepoints, not UTF-8 sequences.

So insofar as your argument is that Char should not be used frequently (and even then only be advanced users), I think this is another argument in favor of renaming Char to Codepoint, to de-emphasize it for use by less knowledgeable users.


Regarding the general thrust of the rest of the proposal:

  1. I like the idea of having a dedicated ASCIIString type that would enable use to perform encoding-specific optimization 😄
    • I think it would be reasonable to start with a struct ASCIIString that is its own dedicated type, and then work later to assess the performance and ergonomics tradeoff in parameterizing StringSlice on an encoding parameter.
  2. I'm less certain that having a parameter for controlling indexing behavior is optimal.
  3. I'm unclear on the semantics of the proposed iter(data).chars[encoding.UTF8]() method — could you elaborate that section?
    • My understanding is that the StringSlice would have some specific encoding its stored in memory as, so I'm uncertain what "casting" that encoding during iteration would do?

Re. (2): One idea we've been discussed a bit internally (but haven't had time to implement yet) is the idea of using named arguments within __getitem__ methods, to enable a syntax like:

var str = StringSlice("abcef")

var bytes: Span[Byte] = str[byte=2:5]

var cps: StringSlice = str[codepoints=1:3]

var chars: StringSlice = str[chars=1:5] # Grapheme clusters

I still have some reservations about this, but I like that it strikes a balance between being a relatively terse syntax for indexing, while also staying absolutely clear and readable about what indexing semantics are being used (helping the programmer stay aware of behavior and potentially performance concerns).

I'm unsure though what I think the default should be for str[1:5] — codepoints or grapheme clusters, I'm not sure.


Apologies again for the long and scattered nature of these comments; I'm erring on the side of leaving any feedback at all, instead of waiting to write something more polished. Thank you again Martin for helping to articulate a possible design direction here 🙂 — like you, I'm excited about the possibilities of using Mojo's 🔥 features to provide flexible and powerful string handling capabilities to our users.

@martinvuyk
Copy link
Contributor Author

Hi @ConnorGray

Apologies for the length of this response—"if I'd had more time, I would have written a shorter letter" and all that😌

Thanks for writing, I actually like reading through well thought ideas. Your "unordered thoughts" match some of my essays lol

More directly, I'm currently thinking should do the following:

  • Rename Char to Codepoint for minimal ambiguity
  • Keep StringSlice.as_bytes() -> Span[Byte]
  • Rename .chars() to StringSlice.codepoints() -> CodepointsIter (iterator over Codepoint)

+1 on all 3 of these

  • Rename .char_slices() to StringSlice.codepoint_slices() (iterator over single-codepoint StringSlice pointers)

This one I'm not so fond of because I like the idea of parametrizing more. I'll expand later

  • Eventually introduce a grapheme cluster iterator called .characters() or .graphemes(), along with Character (owning) and CharacterSlice (view) types.
    • (If we're feeling particularly bold, perhaps we'd call those Grapheme and GraphemeSlice—but that might be too unfamiliar to folks 🙃.)

I think we should have the method be .graphemes() because IMO .characters() is more ambiguous. IMO Grapheme and GraphemeSlice won't be necessary, when iterating over a String or StringSlice with Indexing.GRAPHEME it would just return a StringSlice whose __len__() method might return more than 1 when iterating.

Then, eventually, re-add a StringSlice.__iter__() that either calls .codepoint_slices() or .graphemes(), pending discussion about the merits of each of those (performance vs correctness tradeoff, essentially).

I seriously think it is not necessary to deprecate the default iteration over unicode codepoints given it is the IMO sane default (and practically free using StringSlice). It also follows Python's default which IMO should also count. This is also solved by parametrizing String and StringSlice in a way in which we don't have to make that decision for people.


Using terminology like Codepoint might be unfamiliar at first to folks not already knowledgeable about how computers represent and process text. But I think part of Mojo's broader language philosophy is not shying away from and trying to paper over complexity. Instead, I think it's better to both teach and make that complexity feel manageable with APIs that model it well 🙂

100% Agree on this

I think both types of iteration (over decoded u32 codepoints, and over single-codepoint StringSlice views) are useful, and supporting both is valuable.

Also on this, but I just think that the default should be what is most often used. I also think it's quite straightforward to do:

a = "123"
for item in a:
    c = Codepoint(item)
    # or for what I've read them being used for very often
    # in our unicode casing code and golang's stdlib
    # some people call them "runes"
    utf32_value, utf8_length = Codepoint.parse_utf32_utf8_length(item)
    ...

  • I think it would be reasonable to start with a struct ASCIIString that is its own dedicated type, and then work later to assess the performance and ergonomics tradeoff in parameterizing StringSlice on an encoding parameter.

IMHO this will be much more work than parametrizing and adding constrained on every branch that is not utf8 and progressively adding optimizations. Because it means duplicating every API and docstrings since we don't have good inheritance-like mechanisms yet.

  1. I'm less certain that having a parameter for controlling indexing behavior is optimal.

Quite the opposite, the optimization trickles down to all users and libraries which interact with the generic String. The indexing is very easy to change with a rebind or if we add some APIs to go back and forth. ascii(String("123")) would return a rebound string which is free. Any function which uses any sort of string manipulation and whose signature accepts a generic String would benefit from the perf. gains of e.g. changing to an ASCIIString for your specific use case since you know only those sequences will be there. Just as the performance hit of going full grapheme support will only happen where needed (determined by the end user: the programmer, not us).

  1. I'm unclear on the semantics of the proposed iter(data).chars[encoding.UTF8]() method — could you elaborate that section?

That section was to propose in the case that we wanted a Char type which can cave any one of the 4 encodings underneath*. In the case in which you wanted to iterate over Char of a different encoding than the current string for which iter(some_str) was called, then you could pass the encoding parameter for the iterator. All 4 encodings can be packed inside a UInt32 and as such have the same fix-sized stack-allocated variable. And IMO this still makes sense because a character in ascii is just 1 byte, in utf8 1-4, utf16 1-2, utf32 1. The underlying methods could be parametrized for each encoding and constrained where the methods don't make sense for certain cases.

*: This wouldn't include when Indexing.Grapheme, but this might actually be more of an argument in favor of renaming the type to Codepoint as you proposed.

  • My understanding is that the StringSlice would have some specific encoding its stored in memory as, so I'm uncertain what "casting" that encoding during iteration would do?

It's just a matter of bitcasting the underlying bytes to the proper encoding datatype. For example, the data for a utf16 is actually stored in a List[UInt16] but the String type uses a List[UInt8] (and a StringSlice points to it), so the only necessary step when doing anything is bitcasting the internal pointer e.g. utf16_buffer = rebind[Span[UInt16, __origin_of(some_str)]](some_str.as_bytes()). But both String and StringSlice remain untouched, except inside the body of the methods where the casting needs to happen.


Apologies again for the long and scattered nature of these comments; I'm erring on the side of leaving any feedback at all, instead of waiting to write something more polished. Thank you again Martin for helping to articulate a possible design direction here 🙂 — like you, I'm excited about the possibilities of using Mojo's 🔥 features to provide flexible and powerful string handling capabilities to our users.

Apologies again for being intense sometimes hehe. String handling is what I still love about Python, and I want to make it even better in Mojo 🔥

@jackos
Copy link
Collaborator

jackos commented Feb 13, 2025

thanks @martinvuyk I've added to next design discussion meeting

@leb-kuchen
Copy link

leb-kuchen commented Feb 14, 2025

The thing is how much memory are you willing to use to get O(n) indexing.
The most efficient solution is probably to store a subset of breakpoints.
However segmentation is not as predictable as encoding, so the efficiency will vary.
For ASCII I think, it is a good idea, Rust is experimenting with ASCII Char, just because of how painful UTF-8 conversion are.
But I don't think a separate Grapheme type is a good idea. This would be essentially String.graphemes().next(), i.e. a wrapper around String or StringSlice.

@owenhilyard
Copy link
Contributor

Overall, I agree with @ConnorGray. I think that Mojo benefits by being correct by default, and then having the ability to drop down to get more performance after you read about the hazards. In this case, that means defaulting to Graphemes, and giving codepoint and byte access options. People dealing with large bodies of text will need to understand unicode and what possibilities exist for their problem, or they can just throw CPU at the problem. When people think of Character, 90% of developers mean Graphemes, so Char should be a buffer slice (or just a pointer if we're fine re-validating it on use).

(If we're feeling particularly bold, perhaps we'd call those Grapheme and GraphemeSlice—but that might be too unfamiliar to folks 🙃.)

We can make an alias, so that people who reach for CharacterSlice get the GraphemeSlice type back and can either ignore that or look a bit further.

I like the idea of having a dedicated ASCIIString type that would enable use to perform encoding-specific optimization 😄

The reason I think that we may want to be parameterized over encoding is because not only do you want Ascii, but also UTF-16 (JS interop with WASM, Java), but also UCS-2 (Windows). It means the stdlib-side string handling code is going to be gross, but I would rather go for a solution we can adapt than one where we have "oops, we can't talk to that". This is especially true since I've seen some RAG DB research which wants to use UTF-32 as the input encoding since it's easier to only need to do "codepoints -> graphemes" on a GPU.

That section was to propose in the case that we wanted a Char type which can cave any one of the 4 encodings underneath*. In the case in which you wanted to iterate over Char of a different encoding than the current string for which iter(some_str) was called, then you could pass the encoding parameter for the iterator. All 4 encodings can be packed inside a UInt32 and as such have the same fix-sized stack-allocated variable. And IMO this still makes sense because a character in ascii is just 1 byte, in utf8 1-4, utf16 1-2, utf32 1. The underlying methods could be parametrized for each encoding and constrained where the methods don't make sense for certain cases.

@martinvuyk these "casts" are non-trivial amounts of compute. There are only a few of these "casts" that are cheap, like ASCII -> UTF-8. There are also issues with characters which are impossible to represent in ASCII. I think having the buffer parameterized and converting the whole thing when you want to is a better idea, since I think most code will take input, make everything a single encoding for internal processing, and then potentially convert back to a desired output encoding. These are also potentially fairly large copies to compute, since UTF-16 and UTF-32 usually take more memory than the equivalent UTF-8. I think this makes sense because the encoding is a property of the data in the buffer, not the function called on the data in the buffer.

@leb-kuchen

The thing is how much memory are you willing to use to get O(n) indexing. The most efficient solution is probably to store a subset of breakpoints. However segmentation is not as predictable as encoding, so the efficiency will vary. For ASCII I think, it is a good idea, Rust is experimenting with ASCII Char, just because of how painful UTF-8 conversion are. But I don't think a separate Grapheme type is a good idea. This would be essentially String.graphemes().next(), i.e. a wrapper around String or StringSlice.

O(n) indexing is doable by parsing while we go. If we want better indexing, that's where indexes come into play.

@leb-kuchen
Copy link

O(n) indexing is doable by parsing while we go. If we want better indexing, that's where indexes come into play.

I think it is either O(n) memory or O(n) time. In my opinion it is not worth it and iteration should be faster in 90% of the cases.
It would also limit the extensibility of string if this API was introduced.

I think of the following API

  • graphemes
  • is_grapheme_boundry
  • ceil_grapheme_boundry
  • floor_grapheme_boundry

Graphemes are designed for segmentation , and not really for indexing. Is still possible to possible to design an API this way, but it is not where graphemes shine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants