-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] [proposal] String, ASCII, Unicode, UTF, Graphemes #3988
base: main
Are you sure you want to change the base?
[stdlib] [proposal] String, ASCII, Unicode, UTF, Graphemes #3988
Conversation
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
I like this proposal :) |
|
||
#### Hold off on developing Char further and remove it from stdlib.builtin | ||
|
||
`Char` is currently expensive to create and use compared to a `StringSlice` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have numbers on this? The creation but also using different methods on it. Let's avoid optimizing things without data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have numbers on this?
I don't need numbers to know that the bit-shifting, masking, and for loops going on in decoding utf32 from utf8 is expensive compared to a pointer and a length which is what Span
is and what StringSlice
uses underneath.
also using different methods on it
Comparing a 16 byte long SIMD vector is going to be more expensive* than using count leading zeros (most CPUs have a specialized circuit) and bitwise or-ing (1 micro-op) with a comparison to ascii-max (see #3896).
*: In the context in which this function is used, where the number of bytes for a sequence is a prerequisite for a lot of follow-up code, where the throughput advantage of SIMD is not realized given its latency stalls the pipeline. I have done benchmarking and found such cases in #3697 and #3528
Let's avoid optimizing things without data.
A pointer and a length is always going to be less expensive than transforming data, when an algorithmic throughput difference is not part of the equation. This could be the case for example when transforming to another mathematical plane to avoid solving differential equations. But it is not the case here IMO.
Hi Martin, thanks for taking the time to write up a proposal on this topic! 🙂 Apologies for the length of this response—"if I'd had more time, I would have written a shorter letter" and all that😌 Before responding to your proposal, let me share a bit out where my head is at regarding how we handle these string processing APIs. For some context on my recent work in this area, we had had a discussion internally on the stdlib team and came to a tentative consensus to move forward with the name My current thinking is a modification to what @owenhilyard proposed in Discord here:
More directly, I'm currently thinking should do the following:
Then, eventually, re-add a Now some initial thoughts on your proposal 🙂
Minor clarification: This is correct re. Swift's To that end, that's partly why I'm in favor of renaming our Using terminology like
I think both types of iteration (over decoded As a simple example, our Unicode case conversion logic maps a single So insofar as your argument is that Regarding the general thrust of the rest of the proposal:
Re. (2): One idea we've been discussed a bit internally (but haven't had time to implement yet) is the idea of using named arguments within var str = StringSlice("abcef")
var bytes: Span[Byte] = str[byte=2:5]
var cps: StringSlice = str[codepoints=1:3]
var chars: StringSlice = str[chars=1:5] # Grapheme clusters I still have some reservations about this, but I like that it strikes a balance between being a relatively terse syntax for indexing, while also staying absolutely clear and readable about what indexing semantics are being used (helping the programmer stay aware of behavior and potentially performance concerns). I'm unsure though what I think the default should be for Apologies again for the long and scattered nature of these comments; I'm erring on the side of leaving any feedback at all, instead of waiting to write something more polished. Thank you again Martin for helping to articulate a possible design direction here 🙂 — like you, I'm excited about the possibilities of using Mojo's 🔥 features to provide flexible and powerful string handling capabilities to our users. |
Hi @ConnorGray
Thanks for writing, I actually like reading through well thought ideas. Your "unordered thoughts" match some of my essays lol
+1 on all 3 of these
This one I'm not so fond of because I like the idea of parametrizing more. I'll expand later
I think we should have the method be
I seriously think it is not necessary to deprecate the default iteration over unicode codepoints given it is the IMO sane default (and practically free using
100% Agree on this
Also on this, but I just think that the default should be what is most often used. I also think it's quite straightforward to do: a = "123"
for item in a:
c = Codepoint(item)
# or for what I've read them being used for very often
# in our unicode casing code and golang's stdlib
# some people call them "runes"
utf32_value, utf8_length = Codepoint.parse_utf32_utf8_length(item)
...
IMHO this will be much more work than parametrizing and adding
Quite the opposite, the optimization trickles down to all users and libraries which interact with the generic
That section was to propose in the case that we wanted a *: This wouldn't include when
It's just a matter of bitcasting the underlying bytes to the proper encoding datatype. For example, the data for a utf16 is actually stored in a
Apologies again for being intense sometimes hehe. String handling is what I still love about Python, and I want to make it even better in Mojo 🔥 |
thanks @martinvuyk I've added to next design discussion meeting |
The thing is how much memory are you willing to use to get O(n) indexing. |
Overall, I agree with @ConnorGray. I think that Mojo benefits by being correct by default, and then having the ability to drop down to get more performance after you read about the hazards. In this case, that means defaulting to Graphemes, and giving codepoint and byte access options. People dealing with large bodies of text will need to understand unicode and what possibilities exist for their problem, or they can just throw CPU at the problem. When people think of
We can make an alias, so that people who reach for
The reason I think that we may want to be parameterized over encoding is because not only do you want Ascii, but also UTF-16 (JS interop with WASM, Java), but also UCS-2 (Windows). It means the stdlib-side string handling code is going to be gross, but I would rather go for a solution we can adapt than one where we have "oops, we can't talk to that". This is especially true since I've seen some RAG DB research which wants to use UTF-32 as the input encoding since it's easier to only need to do "codepoints -> graphemes" on a GPU.
@martinvuyk these "casts" are non-trivial amounts of compute. There are only a few of these "casts" that are cheap, like ASCII -> UTF-8. There are also issues with characters which are impossible to represent in ASCII. I think having the buffer parameterized and converting the whole thing when you want to is a better idea, since I think most code will take input, make everything a single encoding for internal processing, and then potentially convert back to a desired output encoding. These are also potentially fairly large copies to compute, since UTF-16 and UTF-32 usually take more memory than the equivalent UTF-8. I think this makes sense because the encoding is a property of the data in the buffer, not the function called on the data in the buffer.
|
I think it is either O(n) memory or O(n) time. In my opinion it is not worth it and iteration should be faster in 90% of the cases. I think of the following API
Graphemes are designed for segmentation , and not really for indexing. Is still possible to possible to design an API this way, but it is not where graphemes shine. |
This is a proposal on the main goals. Followed by a proposed approach to get there.
I would like us to reach a consensus on this topic given it will affect many projects, and involve a lot of work to fully support.
Everyone involved with strings or that should be part of this conversation I can think of: @JoeLoser @ConnorGray @lsh @jackos @mzaks @bgreni @thatstoasty @leb-kuchen