From b9b360f3b292ed5196dfebf97aa206fb25c6696d Mon Sep 17 00:00:00 2001 From: David Zimmermann-Kollenda Date: Thu, 7 Nov 2024 20:44:36 +0100 Subject: [PATCH 1/4] first version of the article --- .../index.qmd | 207 ++++++++++++++++++ 1 file changed, 207 insertions(+) create mode 100644 blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd diff --git a/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd b/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd new file mode 100644 index 0000000..585e1e2 --- /dev/null +++ b/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd @@ -0,0 +1,207 @@ +--- +title: "How to get a Rust-based Package to CRAN" +description: | + This blog entry outlines the journey to get a Rust-based package to CRAN. +author: David Zimmermann-Kollenda +date: "11/07/2024" +image: images/extendr-release-070.png +image-alt: "The extendr logo, letter R in middle of gear." +categories: [CRAN, Package, Best-Practices, rtiktoken] +--- + +I finally did it: I published a Rust-based package on CRAN. +There where a couple of gotchas that I ran into, which I wanted to document here, so that your journey might be a bit faster and easier. + +Before I highlight what I learned, allow me to ~self-promote the backage~ talk a bit about the package first. + + +## The `rtiktoken` Package + +If you haven't been living under a rock in the last couple of years, you will have heard about the new AI revolution using large language models and more specifically GPT models such as OpenAI's ChatGPT models, which are impressively good at dealing with text. + +What might surprise you, is that it's basically impossible to do math with text and in the end, these models are "just" doing (very large) [matrix multiplications](https://xkcd.com/1838/). +Now you might be wondering how it is possible that these mathematical models are so good at text. +The answer lies in encoding the text into numbers (or to use fancy terms: "tokens"). +That is, instead of using "I like Rust and R.", the LLMs would see something like the following `40, 1299, 56665, 326, 460, 13`, which it can use in its calculations. + +Why would I care about tokens? +As you might be aware, most models have a hard cut in terms of content size, called context window. +That is, it can only deal with text less than a fixed number of tokens in size. +For example, OpenAI's GPT4o has a context window of 128,000 tokens ([source](https://platform.openai.com/docs/models/gpt-4o#gpt-4o)). +That might seem plenty, but if you have large texts, you might want to know in advance if it will fail. +Also, you pay per token on most platforms, it's a good idea to know how expensive a call to an LLM is going to be. + +Transforming the text into the tokens is done by using a *tokenizer*, which is more or less a direct mapping of strings to integers. +What is even better is that these mappings/tokenizers are open sourced by OpenAI and can be used locally and there are multiple packages that allow you to do this offline. +These packages are for example the original and official OpenAI python package [`tiktoken`](https://github.com/openai/tiktoken) or implementations in other languages such as [`tiktoken-rs`](https://github.com/zurawiki/tiktoken-rs), or [`tiktoken-go`](https://github.com/pkoukk/tiktoken-go). +Unfortunately, there ~is~ was no R package that does this. + +But you might guess where this is leading. +Thanks to the `rextendr` package, it's really easy to create an R wrapper around Rust crates and eventually release it to CRAN. +So this is what I did. +Introducing the [`rtiktoken`](https://github.com/DavZim/rtiktoken) package, which is a simple wrapper around the [`tiktoken-rs`](https://github.com/zurawiki/tiktoken-rs) crate and as of 2024-11-06 lives on CRAN. + +Before I go into a couple of details that helped me to achieve this, I wanted to quickly show you the output and functionality of the package. +The usage of the package is as easy as the following: + +```r +# install.packages("rtiktoken") +library(rtiktoken) + +text <- "I like Rust and R." +# note we have to specify which tokenizer we want to use +# GPT-4o uses the o200k_base tokenizer, we can use either name here +tokens <- get_tokens(text, "gpt-4o") +tokens +#> [1] 40 1299 56665 326 460 13 + +decode_tokens(tokens, "gpt-4o") +#> [1] "I like Rust and R." + +get_token_count(c("I like Rust and R.", "extendr rocks"), "gpt-4o") +#> [1] 6 3 +``` + +OK, enough of this, how does it work and what did I learn? + + +## The Process of Getting a Package to CRAN + +To get a package to CRAN, we first need to create the package and install a couple of development dependencies: `rextendr`, `devtools`, `usethis`. + + +### 1. Creating a Package + +Once we have a typical R package directory and file structure, we need to add the Rust structure as well. +The easiest way is to use the packages [`devtools`](https://devtools.r-lib.org/) and [`usethis`](https://usethis.r-lib.org/): + +```r +# create the basic folder structure of a package +devtools::create("myRpkg") +# make sure the following are executed from the new package +setwd("myRpkg") +# set license to MIT +usethis::use_mit_license() +# use RMarkdown for Readme +usethis::use_readme_rmd() +# use NEWS.md +usethis::use_news_md() +# use cran-comments.md - will be important later +usethis::use_cran_comments() +``` + +And with this we should have the basic R package. + +A little bit of foreshadowing, but we will have to edit our `DESCRIPTION` file and add the right level of detail for our package, such as author, description, URLs etc. + + +### 2. Add Rust as a Dependency + +Similar to the `usethis` package, there is the `rextendr` package that makes this step pretty straight forward. + +```r +rextendr::use_extendr() +``` + +This will create the required files in `src/` and `src/rust`. + +As the command tells us, whenever we update our Rust code, we should run the following to document the code and build the Rust-parts. + +```r +rextendr::document() +# if we have changed our R-code and its documentation +# we need the following as well +devtools::document() +``` + +And we should be ready to go and call our default Rust function `hello_world()` (defined in `src/rust/src/lib.rs`). + +The actual R and Rust functions are typically the easiest parts of developing a package. +If you need a good starter, have a look here, eg [`R/get_tokens.R`](https://github.com/DavZim/rtiktoken/blob/master/R/get_tokens.R) as well as [`src/rust/src/lib.rs`](https://github.com/DavZim/rtiktoken/blob/master/src/rust/src/lib.rs) (as we can see, I didn't lie when I said it's a *light* wrapper...). + +If we need to add a Rust dependency, we can use `rextendr::use_crate()` or use `cargo add xyz` directly from the `src/rust` directory. + +Now on to the "hard" parts. + + +### 3. Get the Package to CRAN + +First, we need to make sure that the usual hurdles are met, see also the [R Packages (2e) Book](https://r-pkgs.org/). + +- document our functions using [`roxygen2`](https://roxygen2.r-lib.org/) and create the documentation using `devtools::document()` +- fill the details of our `DESCRIPTION` file, write the `README.Rmd` and knit to `README.md` +- use [`testthat`](https://testthat.r-lib.org/) and write tests (not strictly needed, but will most likely safe us in the future!) +- ... other steps that are typically done in R package development +- make sure `devtools::check()` works without a NOTE + +There are however a couple of CRAN-specific rules and best practices for packages using Rust (see also [Using Rust in CRAN Packages](https://cran.r-project.org/web/packages/using_rust.html)). +Most of these requirements are already met, but there are a couple of must-haves and nice-to-haves. + +Note that some of the following `rextendr` functions are currently only available in the development version of `rextendr` (>0.3.1). + + +#### CRAN Defaults + +First, we should tell `rextendr`, that we want to use the CRAN standards. +For example, `Makevars` for different platforms, etc. +We achieve this by calling + +```r +rextendr::use_cran_defaults() +``` + + +#### MSRV + +Then, we should find and record our MSRV (Minimal Supported Rust Version). +Luckily, there is the [`cargo-msrv`](https://github.com/foresterre/cargo-msrv) crate, which tells us what our MSRV is. +To find our MSRV, we can do the following (from the terminal and not from R this time): + +```bash +# install the crate (won't be a dependency of our R package!) +cargo install cargo-msrv +# move to the rust folder and find the MSRV +# note this might take some time... +cd src/rust && cargo msrv find +``` + +After a couple of minutes (the program installs older version of Rust and checks if the package can be build), the cargo-msrv reports for me that my MSRV is "1.65.0" for this test project. +To record this, we can use the `rextendr` package from R again: + +```r +rextendr::use_msrv("1.65.0") +``` + + +#### Vendor Dependencies + +CRAN doesn't allow the download of packages from external servers, that is we cannot download the crates from crates.io, instead we have to *vendor* the crates (ship the packages alongside our package). +This sounds harder than it is, simply run the following and all our Rust dependencies will be archived to `src/rust/vendor.tar.xz` + +```r +rextendr::vendor_pkgs() +``` + + +#### License Updates + +As we are no longer the sole contributor to the package and ship dependencies as well, we need to update our licenses. +Again `rextendr` has us covered (but we might have to run `cargo install cargo-license` from the terminal once before the following) + +```r +rextendr::write_license_note() +``` + +which creates the `LICENSE.note` file with all contributors to all our Rust dependencies. + + +#### CRAN Comments + +Last but not least, we have the aforementioned `cran-comments.md` file, which holds the comments to the CRAN maintainers (at least when we use `usethis::release()`, if we want to release the package manually on the website, we should consider adding the comments manually as well). + +There are a couple of things that resulted in multiple rounds between me and the CRAN maintainers, that can probably be shortened. + +First, mention that it is a Rust-based package, following CRAN's Rust guidelines and rextendr's best practices. + +Secondly, we should address the size of the package, as it might raise some comments if we have added extra crate dependencies. +The comments I got were resolved by saying that the size comes mostly from vendored dependencies (already compressed at max compression level), otherwise the size of the package is minimized as much as possible. From 29c9f8b4ec7cb7067077d4f4507f79211c913b05 Mon Sep 17 00:00:00 2001 From: DavZim Date: Sun, 10 Nov 2024 09:29:02 +0100 Subject: [PATCH 2/4] Update blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd Co-authored-by: Josiah Parry --- blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd b/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd index 585e1e2..0e9c3c2 100644 --- a/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd +++ b/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd @@ -1,5 +1,5 @@ --- -title: "How to get a Rust-based Package to CRAN" +title: "Introducing {rtiktoken}: encode text using OpenAIs Tokenizer" description: | This blog entry outlines the journey to get a Rust-based package to CRAN. author: David Zimmermann-Kollenda From 8872fea6fc18213c33f4bc600574671ac5e90c8a Mon Sep 17 00:00:00 2001 From: DavZim Date: Sun, 10 Nov 2024 09:30:21 +0100 Subject: [PATCH 3/4] Apply suggestions from code review Co-authored-by: Josiah Parry --- .../index.qmd | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd b/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd index 0e9c3c2..d0bb126 100644 --- a/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd +++ b/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd @@ -1,7 +1,7 @@ --- title: "Introducing {rtiktoken}: encode text using OpenAIs Tokenizer" description: | - This blog entry outlines the journey to get a Rust-based package to CRAN. + How I published {rtiktoken} to CRAN author: David Zimmermann-Kollenda date: "11/07/2024" image: images/extendr-release-070.png @@ -9,10 +9,6 @@ image-alt: "The extendr logo, letter R in middle of gear." categories: [CRAN, Package, Best-Practices, rtiktoken] --- -I finally did it: I published a Rust-based package on CRAN. -There where a couple of gotchas that I ran into, which I wanted to document here, so that your journey might be a bit faster and easier. - -Before I highlight what I learned, allow me to ~self-promote the backage~ talk a bit about the package first. ## The `rtiktoken` Package @@ -62,7 +58,6 @@ get_token_count(c("I like Rust and R.", "extendr rocks"), "gpt-4o") #> [1] 6 3 ``` -OK, enough of this, how does it work and what did I learn? ## The Process of Getting a Package to CRAN @@ -153,8 +148,10 @@ rextendr::use_cran_defaults() #### MSRV -Then, we should find and record our MSRV (Minimal Supported Rust Version). -Luckily, there is the [`cargo-msrv`](https://github.com/foresterre/cargo-msrv) crate, which tells us what our MSRV is. +Then, we should find and record our MSRV (Minimal Supported Rust Version). The MSRV is the minimum required +version of rust to be able to build the R package from source. Discovering the MSRV isn't entirely straightforward. +Luckily, there is the [`cargo-msrv`](https://github.com/foresterre/cargo-msrv) crate, which tells us what our MSRV is. +Finding the MSRV involves compiling the rust source code using different versions of Rust. To find our MSRV, we can do the following (from the terminal and not from R this time): ```bash @@ -176,7 +173,7 @@ rextendr::use_msrv("1.65.0") #### Vendor Dependencies CRAN doesn't allow the download of packages from external servers, that is we cannot download the crates from crates.io, instead we have to *vendor* the crates (ship the packages alongside our package). -This sounds harder than it is, simply run the following and all our Rust dependencies will be archived to `src/rust/vendor.tar.xz` +This sounds harder than it is, simply run the following and all our Rust dependencies will be archived to `src/rust/vendor.tar.xz`. ```r rextendr::vendor_pkgs() From ced5876a9478743a7658515aa606332d50087250 Mon Sep 17 00:00:00 2001 From: David Zimmermann-Kollenda Date: Sun, 10 Nov 2024 10:55:54 +0100 Subject: [PATCH 4/4] second version with feedback from Josiah --- .../index.qmd | 178 +++++++++++++++++- 1 file changed, 168 insertions(+), 10 deletions(-) rename blog/posts/{2024-11-07-how-to-get-a-rust-pkg-to-cran => 2024-11-07-introducing-rtiktoken}/index.qmd (51%) diff --git a/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd b/blog/posts/2024-11-07-introducing-rtiktoken/index.qmd similarity index 51% rename from blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd rename to blog/posts/2024-11-07-introducing-rtiktoken/index.qmd index d0bb126..2515a7f 100644 --- a/blog/posts/2024-11-07-how-to-get-a-rust-pkg-to-cran/index.qmd +++ b/blog/posts/2024-11-07-introducing-rtiktoken/index.qmd @@ -4,11 +4,16 @@ description: | How I published {rtiktoken} to CRAN author: David Zimmermann-Kollenda date: "11/07/2024" -image: images/extendr-release-070.png +image: images/extendr-release-070.png [ ] TODO image-alt: "The extendr logo, letter R in middle of gear." categories: [CRAN, Package, Best-Practices, rtiktoken] --- +[ ] TODO image and alt-image + +Im happy to announce that the [`rtiktoken`](https://github.com/DavZim/rtiktoken) package has found it's way to CRAN. +As this was the first time that I used Rust in a real project and I am really happy with the ease of development with Rust and the `rextendr` package, I wanted to document my journey here and introduce the package and its inner workings in more detail. +Lastly, I'll quickly talk about the journey of publishing the R package to CRAN. ## The `rtiktoken` Package @@ -20,20 +25,28 @@ Now you might be wondering how it is possible that these mathematical models are The answer lies in encoding the text into numbers (or to use fancy terms: "tokens"). That is, instead of using "I like Rust and R.", the LLMs would see something like the following `40, 1299, 56665, 326, 460, 13`, which it can use in its calculations. -Why would I care about tokens? + +## Why would I care about tokens? + As you might be aware, most models have a hard cut in terms of content size, called context window. That is, it can only deal with text less than a fixed number of tokens in size. For example, OpenAI's GPT4o has a context window of 128,000 tokens ([source](https://platform.openai.com/docs/models/gpt-4o#gpt-4o)). That might seem plenty, but if you have large texts, you might want to know in advance if it will fail. Also, you pay per token on most platforms, it's a good idea to know how expensive a call to an LLM is going to be. +Another interesting use-case around text similiary is outlined below in its own section. -Transforming the text into the tokens is done by using a *tokenizer*, which is more or less a direct mapping of strings to integers. +Transforming text into tokens is done by using a *tokenizer*, which is more or less a direct mapping of strings to integers. What is even better is that these mappings/tokenizers are open sourced by OpenAI and can be used locally and there are multiple packages that allow you to do this offline. These packages are for example the original and official OpenAI python package [`tiktoken`](https://github.com/openai/tiktoken) or implementations in other languages such as [`tiktoken-rs`](https://github.com/zurawiki/tiktoken-rs), or [`tiktoken-go`](https://github.com/pkoukk/tiktoken-go). Unfortunately, there ~is~ was no R package that does this. +Editor's note: there is or was the [`tok`](https://github.com/mlverse/tok) package, which at the time of writing is archived. +The `tok` package acts as a wrapper around [Hugging Face Tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer), but has no offline capabilities, instead it first needs to download the tokenizers. + + +## Functionality But you might guess where this is leading. -Thanks to the `rextendr` package, it's really easy to create an R wrapper around Rust crates and eventually release it to CRAN. +Thanks to the wonderful `rextendr` package, it's really easy to create an R wrapper around Rust crates and eventually release it to CRAN. So this is what I did. Introducing the [`rtiktoken`](https://github.com/DavZim/rtiktoken) package, which is a simple wrapper around the [`tiktoken-rs`](https://github.com/zurawiki/tiktoken-rs) crate and as of 2024-11-06 lives on CRAN. @@ -59,6 +72,91 @@ get_token_count(c("I like Rust and R.", "extendr rocks"), "gpt-4o") ``` +## Text Similarity Use-Cases + +Another really interesting use case is in the field of Natural Language Processing (NLP), which is finding similar text. +If you want to search through a text or compare texts, oftentimes you want to do some kinds of stemming in order to have better matching. +For example "walked" and "walked" will not be matched by classical bag-of-words approaches without stemming, because the words are not identical. +If we use stemming, we transform the text into their base-form: "walked" and "walk". +Therefore we can find the relation between the two. + +This technique is especially handy in LLM projects with large information retrieval tasks, where we often use Retrieval-Augmented-Generation (RAG), which is a technique to find an answer to a question based on a provided knowledgebase. +That is a fancy way of saying that we have a large database of text and want to find an answer by asking a LLM and providing relevant context for the question. +Instead of giving all text, we only provide relevant chunks of the text based on some kinds of similarity score between the database and the question/prompt. + +Let's give a small example. +Given that we have the following text (= our knowledgebase), + +``` +"Alice likes to program using Rust and R" +"Bob and his dog Edgar walked in the park" +"Charlie likes to read books" +``` + +we want to find an answer to our question (= our prompt) "Who enjoys to go for a walk?". + +Let's also assume that we have a very small language model that can only deal a small number of words (or more precise: tokens) at a time, which means we cannot give all of our knowledgebase as context. +Note as an alternative, we could assume that we don't have three entries in our knowledgebase but thousands or more. + +Instead we want to filter and only provide the top two closest matches of our knowledgebase. +To find the closest matches, we can employ another technique called vector search or even better hybrid search. + +In a vector search we embed each entry of our knowledgebase as well as our prompt using an embedding model (see for example [OpenAI docs](https://platform.openai.com/docs/guides/embeddings)) and use a function such as cosine similarity to find the best matches between our prompt and our knowledgebase. + +Hybrid search enhances this technique by not only searching through the embedding space but by also searching through the "human" space using techniques such as [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) or the more advanced [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) to find the best matches. +As a sidenote, I couldn't find a light-weight and permissive licensed R package that implements the BM25 algorithm, but I was able to create the [`rbm25`](https://github.com/DavZim/rbm25/) package alongside the `rtiktoken` package that I introduce here. +The package is not yet on CRAN but will follow soon. + +Coming back to our RAG process, we now enhance the original prompt with our selected knowledge from our knowledgebase. +Something along the lines of + +``` +Hey ChatGPT, + +{PROMPT} + +Only consider the following information: + +{TOP_N_KNOWLEDGE_MATCHES} +``` + +which would transform into the following when we consider only 2 context matches. + +``` +Hey ChatGPT, + +Who enjoys to go for a walk? + +Only consider the following information: + +"Bob and his dog Edgar walked in the park" +"Alice likes to program using Rust and R" +``` + +This is of course simplified and better prompt engineering will produce better results, but this brings across the basics. + +Now coming back to why tokens are interesting here. +Remember that I said that "walked" and "walk" are not matched on a word-level. +The problem is without stemming, TF-IDF or BM25 will not match the words from our query to the words from the right text in our knowledgebase and the correct text might therefore be excluded from the given context, leading to incorrect or incomplete answers. + +If we instead transform our text as well as our knowledgebase into tokens, we can see that a match is possible, as "walked" is tokenized to `26072, 295` and "walk" is tokenized to `26072`. + +The full hybrid search then becomes the following: + +1. take our knowledgebase and calculate + 1. embeddings, e.g., using Ada-002 from OpenAI + 2. tokens, using `rtiktoken` +2. on a new prompt, calculate embeddings and tokens as well +3. find text in our knowledgebase (= context) with the highest weighted similarity scores, based on + 1. vector similarity based on embedding scores + 2. BM25 scores using words + 3. BM25 scores using tokens +4. enhance the prompt using the context +5. ask an LLM for the answer using the enhanced prompt + +Note that this might be a bit over-the-top for some use-cases. +But I have made good experience with it so-far, as this retrieves relevant information quite reliably. + ## The Process of Getting a Package to CRAN @@ -68,11 +166,11 @@ To get a package to CRAN, we first need to create the package and install a coup ### 1. Creating a Package Once we have a typical R package directory and file structure, we need to add the Rust structure as well. -The easiest way is to use the packages [`devtools`](https://devtools.r-lib.org/) and [`usethis`](https://usethis.r-lib.org/): +The easiest way is to use the packages [`usethis`](https://usethis.r-lib.org/): ```r # create the basic folder structure of a package -devtools::create("myRpkg") +usethis::create_package("myRpkg") # make sure the following are executed from the new package setwd("myRpkg") # set license to MIT @@ -104,15 +202,68 @@ As the command tells us, whenever we update our Rust code, we should run the fol ```r rextendr::document() -# if we have changed our R-code and its documentation -# we need the following as well -devtools::document() ``` And we should be ready to go and call our default Rust function `hello_world()` (defined in `src/rust/src/lib.rs`). The actual R and Rust functions are typically the easiest parts of developing a package. -If you need a good starter, have a look here, eg [`R/get_tokens.R`](https://github.com/DavZim/rtiktoken/blob/master/R/get_tokens.R) as well as [`src/rust/src/lib.rs`](https://github.com/DavZim/rtiktoken/blob/master/src/rust/src/lib.rs) (as we can see, I didn't lie when I said it's a *light* wrapper...). +But to give you an example, `rtiktoken` has a function `get_tokens()` (Source available at [`R/get_tokens.R`](https://github.com/DavZim/rtiktoken/blob/master/R/get_tokens.R)), which, as we saw earlier, converts the text to the respective tokens. +The function looks like this (note the actual function is a small wrapper around `get_tokens_internal()` for vectorized capabilities): + +```r +get_tokens <- function(text, model) { + if (length(text) > 1) { + return(lapply(text, function(x) get_tokens_internal(x, model))) + } else { + get_tokens_internal(text, model) + } +} + +get_tokens_internal <- function(text, model) { + res <- tryCatch( + rs_get_tokens(text, model), + error = function(e) { + stop(paste("Could not get tokens from text:", e)) + } + ) + res +} +``` + +The main functionality is implemented in the function `rs_get_tokens()`, which is defined in [`src/rust/src/lib.rs`](https://github.com/DavZim/rtiktoken/blob/master/src/rust/src/lib.rs) and looks like this + +```rust +use extendr_api::prelude::*; +use tiktoken_rs::{ + get_bpe_from_model, + get_bpe_from_tokenizer, + tokenizer::{ + get_tokenizer, + Tokenizer, + } +}; + +// encodes text to tokens +#[extendr] +fn rs_get_tokens(text: &str, model: &str) -> Vec { + // try to load the BPE from model (gpt-4o), + // otherwise from tokenizer (o200k-base) + let bpe = match get_bpe_from_model(model) { + Ok(bpe) => bpe, + Err(_) => { + get_bpe_from_tokenizer(str_to_tokenizer(model)) + .expect("Failed to get BPE from tokenizer") + }, + }; + + let tokens = bpe.encode_with_special_tokens(text); + tokens +} +``` + +The Rust function `str_to_tokenizer()` is omitted for brevity from this example. + +I think that neatly proves the point that the package is "just" a thin wrapper around the `tiktoken-rs` crate using the `extendr_api` Rust crate. If we need to add a Rust dependency, we can use `rextendr::use_crate()` or use `cargo add xyz` directly from the `src/rust` directory. @@ -131,6 +282,13 @@ First, we need to make sure that the usual hurdles are met, see also the [R Pack There are however a couple of CRAN-specific rules and best practices for packages using Rust (see also [Using Rust in CRAN Packages](https://cran.r-project.org/web/packages/using_rust.html)). Most of these requirements are already met, but there are a couple of must-haves and nice-to-haves. +These are: + +- Rust needs to be declared a system dependency +- The rust and cargo versions must be reported before building the package +- Rust dependencies need to be vendored (included) in the R package +- Ensure that the minimum supported version of rust (MSRV) is available +- Use a maximum of 2 threads to build the package Note that some of the following `rextendr` functions are currently only available in the development version of `rextendr` (>0.3.1).