Skip to content

Conversation

@mihnita
Copy link
Contributor

@mihnita mihnita commented Nov 24, 2025

This is not intended to merge (unless you want to).
It is only to share the first results.

Scanning "War and Peace" from Project Gutenberg.
The numbers in the brackets are the number of lines (~ 66K) and the number of words (~1.2M)
The important numbers are the nano-seconds, after :.

Preliminary: calling Rust is about 10 times slower than calling the icu4j implementation


WSL2 (Windows Subsystem for Linux)

Running org.unicode.icu4x.RustWordSegmenterTest
    Rust (66,037 :: 1,255,275) : 617,892,167 ns
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.722 sec - in org.unicode.icu4x.RustWordSegmenterTest
Running org.unicode.icu4x.IcuWordSegmenterTest
    ICU4J(66,037 :: 1,189,238) : 41,733,643 ns // reused BreakIterator
    ICU4J(66,037 :: 1,189,238) : 48,372,655 ns // rebuilt BreakIterator
    ICU4J(66,037 :: 1,255,275) : 116,971,994 ns // new Segmenter API
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.259 sec - in org.unicode.icu4x.IcuWordSegmenterTest

Windows

Running org.unicode.icu4x.RustWordSegmenterTest
    Rust (66,037 :: 1,255,276) : 512,522,400 ns
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.614 sec - in org.unicode.icu4x.RustWordSegmenterTest
Running org.unicode.icu4x.IcuWordSegmenterTest
    ICU4J(66,037 :: 1,189,239) : 56,442,000 ns // reused BreakIterator
    ICU4J(66,037 :: 1,189,239) : 62,615,300 ns // rebuilt BreakIterator
    ICU4J(66,037 :: 1,255,276) : 140,487,700 ns // new Segmenter API
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.314 sec - in org.unicode.icu4x.IcuWordSegmenterTest

macOS

Running org.unicode.icu4x.RustWordSegmenterTest
    Rust (66,037 :: 1,255,275) : 693,592,250 ns
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.987 sec - in org.unicode.icu4x.RustWordSegmenterTest
Running org.unicode.icu4x.IcuWordSegmenterTest
    ICU4J(66,037 :: 1,189,238) : 35,295,875 ns // reused BreakIterator
    ICU4J(66,037 :: 1,189,238) : 112,784,875 ns // rebuilt BreakIterator
    ICU4J(66,037 :: 1,255,275) : 86,888,916 ns // new Segmenter API
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.24 sec - in org.unicode.icu4x.IcuWordSegmenterTest

@mihnita mihnita requested a review from a team as a code owner November 24, 2025 23:23
@mihnita mihnita requested review from Manishearth and sffc November 24, 2025 23:23
@mihnita mihnita marked this pull request as draft November 24, 2025 23:25
@mihnita mihnita changed the title First test, very preliminary Calling Rust from Java. First test, very preliminary Nov 24, 2025
@Manishearth
Copy link
Member

I kind of actually do want to merge this: is there a way to mark the ICU4J dependency as optional so normal CI builds don't need to build ICU4J?

Main issue would be whether we are allowed to include the public domain text, license-wise. I know it's public domain, but @sffc remembers what our exact policies are here.

@mihnita
Copy link
Contributor Author

mihnita commented Nov 25, 2025

I kind of actually do want to merge this: is there a way to mark the ICU4J dependency as optional so normal CI builds don't need to build ICU4J?

Nothing against it from my side.
I just didn't think it brings much value.

Listing ICU4J as a dependency means the jar will be downloaded and used directly from Maven Central.
(https://mvnrepository.com/artifact/com.ibm.icu/icu4j/78.1, see "Files" and click on jar)
It is not built.


Main issue would be whether we are allowed to include the public domain text, license-wise. I know it's public domain, but @sffc remembers what our exact policies are here.

From the book "... You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org."

https://www.gutenberg.org/policy/license.html

ICU lists all licenses for all the code (and not only code) in the main LICENSE.
I don't know about icu4x.

We can obviously replace it with something else with a more convenient license, as long as it is relatively big. But not too big, as I'm reading the whole text in memory so that I don't "taint" the measurements with data reading.
We can also loop on the same content several times.

@Manishearth
Copy link
Member

Opened rust-diplomat/diplomat#1005 to discuss different string handling mechanisms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants