diff --git a/.Rbuildignore b/.Rbuildignore index 2ab0f9a..b2b348c 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -2,3 +2,8 @@ ^\.appveyor\.yml$ ^README\.md$ LICENSE +^.*\.Rproj$ +^\.Rproj\.user$ +^_pkgdown\.yml$ +^docs$ +^pkgdown$ diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml new file mode 100644 index 0000000..bfc9f4d --- /dev/null +++ b/.github/workflows/pkgdown.yaml @@ -0,0 +1,49 @@ +# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples +# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help +on: + push: + branches: [main, master] + pull_request: + release: + types: [published] + workflow_dispatch: + +name: pkgdown.yaml + +permissions: read-all + +jobs: + pkgdown: + runs-on: ubuntu-latest + # Only restrict concurrency for non-PR jobs + concurrency: + group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }} + env: + GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + permissions: + contents: write + steps: + - uses: actions/checkout@v4 + + - uses: r-lib/actions/setup-pandoc@v2 + + - uses: r-lib/actions/setup-r@v2 + with: + use-public-rspm: true + + - uses: r-lib/actions/setup-r-dependencies@v2 + with: + extra-packages: any::pkgdown, local::. + needs: website + + - name: Build site + run: pkgdown::build_site_github_pages(new_process = FALSE, install = FALSE) + shell: Rscript {0} + + - name: Deploy to GitHub pages 🚀 + if: github.event_name != 'pull_request' + uses: JamesIves/github-pages-deploy-action@v4.5.0 + with: + clean: false + branch: gh-pages + folder: docs diff --git a/DESCRIPTION b/DESCRIPTION index 1b2e657..fa9c548 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -8,10 +8,13 @@ Authors@R: c(person("Morgan", "Jacob", role = c("aut", "cre", "cph"), email = "m Author: Morgan Jacob [aut, cre, cph], Sebastian Krantz [ctb] Maintainer: Morgan Jacob Description: Basic functions, implemented in C, for large data manipulation. Fast vectorised ifelse()/nested if()/switch() functions, psum()/pprod() functions equivalent to pmin()/pmax() plus others which are missing from base R. Most of these functions are callable at C level. +URL: https://fastverse.github.io/kit/, https://github.com/fastverse/kit License: GPL-3 Depends: R (>= 3.1.0) +Suggests: knitr, rmarkdown +VignetteBuilder: knitr Encoding: UTF-8 -BugReports: https://github.com/2005m/kit/issues +BugReports: https://github.com/fastverse/kit/issues NeedsCompilation: yes ByteCompile: TRUE Repository: CRAN diff --git a/NEWS.md b/NEWS.md new file mode 100644 index 0000000..0e3142a --- /dev/null +++ b/NEWS.md @@ -0,0 +1,254 @@ +# kit 0.0.20 (2025-04-17) + +### Notes + +- Update copyright date in c files + +- Fix note on CRAN regarding Rf_isFrame + +# kit 0.0.19 (2024-09-07) + +### Bug Fixes + +- Fix multiple warnings in C code. + +# kit 0.0.18 (2024-06-06) + +### Bug Fixes + +- Fix `iif` tests for new version of R. + +# kit 0.0.17 (2024-05-03) + +### Bug Fixes + +- Fix `nswitch`. Thanks to Sebastian Krantz for raising an issue. + +### Notes + +- Update copyright date in c files + +- Fix note on CRAN regarding SETLENGTH + +# kit 0.0.16 (2024-03-01) + +### Notes + +- Check if `"kit.nThread"` is defined before setting it to `1L` + +# kit 0.0.15 (2023-10-01) + +### Notes + +- Correct typo in configure file + +# kit 0.0.14 (2023-08-12) + +### Notes + +- Update configure file to extend support for GCC + +- Correct warnings in NEWS.Rd (strong) + +- Correct typo in funique.Rd thanks to @davidbudzynski + +# kit 0.0.13 (2023-02-24) + +### Notes + +- Function `pprod` now returns double output even if inputs are integer - in line with `base::prod` - to avoid integer overflows. + +- Update configure file + +# kit 0.0.12 (2022-10-26) + +### New Features + +- Function `pcountNA` is equivalent to `pcount(..., value = NA)`. + +- Function `pcountNA` and `pcount(..., value = NA)` allow `NA` counting with mixed data type (including `data.frame`). `pcountNA` also supports list-vectors as inputs and counts empty or `NULL` elements as `NA`. + +- Functions `panyv`, `panyNA`, `pallv` and `pallNA` are added as efficient wrappers around `pcount` and `pcountNA`. They are parallel equivalents of scalar functions `base::anyNA` and `anyv`, `allv` and `allNA` in the 'collapse' R package. + +- Functions `pfirst` and `plast` are added to efficiently obtain the row-wise first and last non-missing value or non-empty element of lists. They are parallel equivalents to the (column-wise) `ffirst` and `flast` functions in the 'collapse' R package. Implemented by @SebKrantz. + +- Functions `psum/pprod/pmean` also support logical vectors as input. Implemented by @SebKrantz. + +### Bug Fixes + +- Function `charToFact` was not returning proper results. Thanks to @alex-raw for raising an issue. + +### Notes + +- Function `pprod` now returns double output even if inputs are integer - in line with `base::prod` - to avoid integer overflows. + +- C compiler warnings on CRAN R-devel caused by compilation with -Wstrict-prototypes are now fixed. Declaration of functions without prototypes is depreciated in all versions of C. Thanks to Sebastian Krantz for the PR. + +# kit 0.0.11 (2022-03-19) + +### New Features + +- Function `pcount` now supports data.frame. + +### Bug Fixes + +- Function `pcount` now works with specific NA values, i.e. NA_real_, NA_character_ etc... + +# kit 0.0.10 (2021-11-28) + +### New Features + +- Function `psum`, `pmean`, `pprod`, `pany` and `pall` now support lists. Thanks to Sebastian Krantz for the request and code suggestion. + +### Bug Fixes + +- Function `topn` should now work for ALTREP object. Thanks to @ben-schwen for raising an issue. + +# kit 0.0.9 (2021-09-12) + +### Notes + +- Re-organise header to prevent compilation errors with new version of Clang due to conflicts between R C headers and OpenMP. + +# kit 0.0.8 (2021-08-21) + +### New Features + +- Function `funique` now preserves the attributes if the input is a `data.table`, `tibble` or similar objects. Thanks to Sebastian Krantz for the request. + +- Function `topn` now defaults to base R `order` for large value of `n`. Please see updated documentation for more information `?kit::topn`. + +- Function `charToFact` gains a new argument `addNA=TRUE` to be used to include (or not) `NA` in levels of the output. + +- Function `shareData`, `getData` and `clearData` implemented to share data objects between R sessions. These functions are experimental and might change in the future. Feedback is welcome. Please see `?kit::shareData` for more information. + +### Notes + +- Few `calloc` functions at C level have been replaced by R C API function `Calloc` to avoid valgrind errors/warnings in Travis CI. + +- Errors reported by `rchk` on CRAN have been fixed. + +# kit 0.0.7 (2021-03-07) + +### New Features + +- Function `charToFact` gains a new argument `decreasing=FALSE` to be used to order levels of the output in decreasing or increasing order. + +- Function `topn` gains a new argument `index=TRUE` to be used return index (`TRUE`) or values (`FALSE`) of input vector. + +### Bug Fixes + +- Some tests of memory access errors using valgrind and AddressSanitizer were reported by CRAN. An attempt to fix these errors has been submitted as part of this package version. It also seems that these same errors were causing some tests to fail for `funique` and `psort` on some platforms. + +### Notes + +- Functions `pmean`, `pprod` and `psum` will result in error if used with factors. Documentation has been updated. + +# kit 0.0.6 (2021-02-21) + +### New Features + +- Function `funique` and `fduplicated` gain an additional argument `fromLast=FALSE` to indicate whether the search should start from the end or beginning [PR#11](https://github.com/2005m/kit/pull/11). + +- Functions `pall`, `pany`, `pmean`, `pprod` and `psum` accept `data.frame` as input [PR#15](https://github.com/2005m/kit/pull/15). Please see documentation for more information. + +- Function `charToFact` is equivalent to to base R `as.factor` but is much quicker and only converts character vector to factor. Note that it is parallelised. For more details and benchmark please see `?kit::charToFact`. + +- Function `psort` is experimental and equivalent to to base R `sort` but is only for character vector. It can sort by "C locale" or by "R session locale". For more details and benchmark please see `?kit::psort`. + +### Notes + +- A few OpenMP directives were missing for functions `vswitch` and `nswitch` for character vectors. These have been added in [PR#12](https://github.com/2005m/kit/pull/12). + +- Function `funique` was not preserving attributes for character, logical and complex vectors/data.frames. Thanks to Sebastian Krantz (@SebKrantz) for bringing that to my attention. This has been fixed in [PR#13](https://github.com/2005m/kit/pull/13). + +- Functions `funique` and `uniqLen` should now be faster for `factor` and `logical` vectors [PR#14](https://github.com/2005m/kit/pull/14). + +# kit 0.0.5 (2020-11-21) + +### New Features + +- Function `uniqLen(x)` is equivalent to base R `length(unique(x))` and `uniqueN` in package [data.table](https://CRAN.R-project.org/package=data.table). Function `uniqLen`, implemented in C, supports vectors, `data.frame` and `matrix`. It should be faster than these functions. For more details and benchmark please see `?kit::uniqLen`. + +- Function `vswitch` now supports mixed encoding and gains an additional argument `checkEnc=TRUE`. Thanks to Xianying Tan (@shrektan) for the request and review [PR#7](https://github.com/2005m/kit/pull/7). + +- Function `nswitch` is a nested version of function `vswitch` and also supports mixed encoding. Please see please see `?kit::nswitch` for further details. Thanks to Xianying Tan (@shrektan) for the request and review [PR#10](https://github.com/2005m/kit/pull/10). + +### Notes + +- Small algorithmic improvement for functions `fduplicated`, `funique` and `countOccur` for `vectors`, `data.frame` and `matrix`. + +- A tests folder has been added to the source package to track coverage and bugs. + +### C-Level Facilities + +- Function `nif` has been split into two distinctive functions at C level, one has its arguments evaluated in a lazy way and is for R users and the other one (nifInternalR) is not lazy and is intended for usage at C level. + +# kit 0.0.4 (2020-07-21) + +### New Features + +- Function `countOccur(x)`, implemented in C, is comparable to `base` R function `table`. It returns a `data.frame` and is between 3 to 50 times faster. For more details, please see `?kit::countOccur`. + +- Functions `funique` and `fduplicated` now support matrices. Additionally, these two functions should also have better performance compare to previous release. + +- Functions `topn` has an additional argument `hasna=TRUE` to indicates whether data contains `NA` value or not. If the data does not contain `NA` values, the function should be faster. + +### C-Level Facilities + +- A few C functions have been added to subset `data.frame` and `matrix` as well as do other operations. These functions are not exported or visible to the user but might become available and callable at C level in the future. + +### Bug Fixes + +- Function `fpos` was not properly handling `NaN` and `NA` for complex and double. This should now be fixed. The function has also been changed in case the 'needle' and 'haysatck' are vectors so that a vector is returned. + +- Functions `funique` and `fduplicated` were not properly handling data containing `POSIX` data. This has now been fixed. + +# kit 0.0.3 (2020-06-21) + +### New Features + +- Functions `fduplicated(x)` and `funique(x)`, implemented in C, are comparable to `base` R functions `duplicated` and `unique`. For more details, please see `?kit::funique`. + +- Functions `psum` and `pprod` have now better performance for type double and complex. + +### Bug Fixes + +- Function `count(x, y)` now checks that `x` and `y` have the same class and levels. So does `pcount`. + +- Function `pmean` was not callable at C level because of a typo. This is now fixed. + +# kit 0.0.2 (2020-05-22) + +### New Features + +- Function `count(x, value)`, implemented in C, to simply count the number of times an element `value` occurs in a vector or in a list `x`. For more details, please see `?kit::count`. + +- Function `pmean(..., na.rm=FALSE)`, `pall(..., na.rm=FALSE)`, `pany(..., na.rm=FALSE)` and `pcount(..., value)`, implemented in C, are similar to already available function `psum` and `pprod`. These functions respectively apply base R functions `mean`, `all` and `any` element-wise. For more details, benchmarks and help, please see `?kit::pmean`. + +### Bug Fixes + +- Fix Solaris Unicode warnings for NEWS file. Benchmarks have been moved from the NEWS file to each function Rd file. + +- Fix some `NA` edge cases for `pprod` and `psum` so these functions behave more like base R function `prod` and `sum`. + +- Fix installation errors for version of R (<3.5.0). + +# kit 0.0.1 (2020-05-03) + +### Initial Release + +- Function `fpos(needle, haystack, all=TRUE, overlap=TRUE)`, implemented in C, is inspired by base function `which` when used in the following form `which(x == y, arr.ind =TRUE)`. Function `fpos` returns the index(es) or position(s) of a matrix/vector within a larger matrix/vector. Please see `?kit::fpos` for more details. + +- Function `iif(test, yes, no, na=NULL, tprom=FALSE, nThread=getOption("kit.nThread"))`, originally contributed as `fifelse` in package [data.table](https://CRAN.R-project.org/package=data.table), was moved to package kit to be developed independently. Unlike the current version of `fifelse`, `iif` allows type promotion like base function `ifelse`. For further details about the differences with `fifelse`, as well as `hutils::if_else` and `dplyr::if_else`, please see `?kit::iif`. + +- Function `nif(..., default=NULL)`, implemented in C, is inspired by *SQL CASE WHEN*. It is comparable to [dplyr](https://CRAN.R-project.org/package=dplyr) function `case_when` however it evaluates it arguments in a lazy way (i.e only when needed). Function `nif` was originally contributed as function `fcase` in the [data.table](https://CRAN.R-project.org/package=data.table) package but then moved to package kit so its development may resume independently. Please see `?kit::nif` for more details. + +- Function `pprod(..., na.rm=FALSE)` and `psum(..., na.rm=FALSE)`, implemented in C, are inspired by base function `pmin` and `pmax`. These new functions work only for integer, double and complex types and do not recycle vectors. Please see `?kit::psum` for more details. + +- Function `setlevels(x, old, new, skip_absent=FALSE)`, implemented in C, may be used to set levels of a factor object. Please see `?kit::setlevels` for more details. + +- Function `topn(vec, n=6L, decreasing=TRUE)`, implemented in C, returns the top largest or smallest `n` values for a given numeric vector `vec`. It is inspired by `dplyr::top_n` and equivalent to base functions order and sort in specific cases as shown in the documentation. Please see `?kit::topn` for more details. + +- Function `vswitch(x, values, outputs, default=NULL, nThread=getOption("kit.nThread"))`, implemented in C, is a vectorised version of `base` R function `switch`. This function can also be seen as a particular case of function `nif`. Please see `?kit::switch` for more details. + diff --git a/README.md b/README.md index df37561..e5cb6c1 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,90 @@ # kit -R Package: Basic functions implemented in C (and for some missing from base R) [![CRAN](https://www.r-pkg.org/badges/version-last-release/kit?color=blue)](https://cran.r-project.org/package=kit) [![CRAN](https://badges.cranchecks.info/flavor/release/kit.svg)](https://cran.r-project.org/web/checks/check_results_kit.html) -[![License: GPL v3](https://img.shields.io/github/license/2005m/kit)](https://www.gnu.org/licenses/gpl-3.0) -[![R-CMD-check](https://github.com/2005m/kit/workflows/R-CMD-check/badge.svg)](https://github.com/2005m/kit/actions) +[![License: GPL v3](https://img.shields.io/github/license/fastverse/kit)](https://www.gnu.org/licenses/gpl-3.0) +[![R-CMD-check](https://github.com/fastverse/kit/workflows/R-CMD-check/badge.svg)](https://github.com/fastverse/kit/actions) [![Coverage Status](https://codecov.io/gh/2005m/kit/graph/badge.svg)](https://codecov.io/github/2005m/kit?branch=master) [![downloads](https://cranlogs.r-pkg.org/badges/kit)](https://www.r-pkg.org/pkg/kit) [![kit status badge](https://fastverse.r-universe.dev/badges/kit)](https://fastverse.r-universe.dev) + +Fast data manipulation functions implemented in C for large datasets. Provides vectorized alternatives to base R functions with significant performance improvements. + +## Installation + +```r +# From CRAN +install.packages("kit") + +# Development version +install.packages("kit", repos = "https://fastverse.r-universe.dev") +``` + +## Features + +### Parallel Statistical Functions + +Vector-valued functions operating in parallel over vectors or data frames: + +- **`psum`, `pprod`, `pmean`**: Parallel sum, product, and mean (similar to `pmin`/`pmax`) +- **`pall`, `pany`**: Parallel all/any operations +- **`pcount`, `pcountNA`**: Count occurrences of values or NAs +- **`pfirst`, `plast`**: First/last non-missing values + +```r +x <- c(1, 3, NA, 5) +y <- c(2, NA, 4, 1) +psum(x, y, na.rm = TRUE) # [1] 3 3 4 6 +pmean(x, y, na.rm = TRUE) # [1] 1.5 3.0 4.0 3.0 +``` + +### Vectorized and Nested Switches + +Fast vectorized conditional logic: + +- **`iif`**: Fast replacement for `ifelse()` with attribute preservation +- **`nif`**: Nested if-else (SQL CASE WHEN equivalent) +- **`vswitch`, `nswitch`**: Vectorized switch statements + +```r +iif(x > 2, x, x - 1) # Preserves attributes unlike base::ifelse +nif(x == 1, "one", x == 2, "two", default = "other") +``` + +### Sorting + +- **`psort`**: Parallel sort for character vectors +- **`topn`**: Efficient partial sort (top N values) without full sorting + +```r +topn(x, n = 6L, decreasing = TRUE) # Much faster than order()[1:6] +``` + +### Factors + +- **`charToFact`**: Fast character-to-factor conversion +- **`setlevels`**: Change factor levels by reference + +### Unique Values and Counts + +- **`funique`, `fduplicated`**: Fast unique/duplicated operations +- **`uniqLen`**: Fast equivalent to `length(unique(x))` +- **`count`, `countNA`, `countOccur`**: Count element occurrences + +```r +funique(iris$Species) # Faster than base::unique +uniqLen(iris$Species) # Faster than length(unique()) +``` + +### Miscellaneous + +- **`fpos`**: Find matrix/vector positions within larger structures +- **`shareData`, `getData`, `clearData`**: Share data between R sessions + +## Documentation + +Full documentation available at: https://fastverse.github.io/kit/ + +## License + +GPL-3 diff --git a/_pkgdown.yml b/_pkgdown.yml new file mode 100644 index 0000000..6ac23f1 --- /dev/null +++ b/_pkgdown.yml @@ -0,0 +1,82 @@ +url: https://fastverse.github.io/kit/ + +home: + title: Data Manipulation Functions Implemented in C + +template: + bootstrap: 5 + bootswatch: sandstone + theme: ayu-dark # Or: ayu-mirage + math-rendering: katex + bslib: + primary: "#1e2124" # "#202224" # "#242424" # "#003254" + code-color: "#004573" # "#9c0027" # "#004d80" # b3002d + gray-dark: "#3f464d" + +development: + mode: auto + +navbar: + title: kit + structure: + left: + - reference + - articles + - news + - blog + right: + - search + - github + components: + reference: + text: Documentation + href: reference/index.html + articles: + text: Vignettes + href: articles/index.html + news: + text: News + href: news/index.html + github: + icon: fa-github + href: https://github.com/fastverse/kit + aria-label: GitHub + + +reference: +- title: "Parallel Statistical Functions" + desc: "Vector-valued (statistical) functions operating in parallel over vectors passed as arguments, or a single list of vectors/data frame." +- contents: + - parallel-funs +- title: "Vectorised and Nested Switches" + desc: "Fast vectorized and nested switches." +- contents: + - iif + - nif + - vswitch/nswitch +- title: "Sorting" + desc: "Parallel sort for strings and partial sort (N largest/smallest)." +- contents: + - psort + - topn +- title: "Factors" + desc: "Fast character to factor conversion and changing factor levels by reference." +- contents: + - charToFact + - setlevels +- title: "Unique Values and Counts" + desc: "Fast duplicated and unique and count the number of times element(s) occur." +- contents: + - fduplicated/funique + - count +- title: "Miscellaneous" + desc: "Find a matrix position inside a larger matrix and share data between R sessions." +- contents: + - fpos + - shareData/getData/clearData + +articles: +- title: "Introduction to kit" + desc: Introduces the package, including a walk-through of all main features. + contents: + - introduction diff --git a/inst/NEWS.Rd b/inst/NEWS.Rd deleted file mode 100644 index 3370f1c..0000000 --- a/inst/NEWS.Rd +++ /dev/null @@ -1,395 +0,0 @@ -\name{NEWS} -\title{News for \R Package \pkg{kit}} -\encoding{UTF-8} - -\newcommand{\CRANpkg}{\href{https://CRAN.R-project.org/package=#1}{\pkg{#1}}} - -\section{version 0.0.20 (2025-04-17)}{ - \subsection{Notes}{ - \itemize{ - \item Update copyright date in c files - - \item Fix note on CRAN regarding Rf_isFrame - } - } -} - -\section{version 0.0.19 (2024-09-07)}{ - \subsection{Bug Fixes}{ - \itemize{ - \item Fix multiple warnings in C code. - } - } -} - -\section{version 0.0.18 (2024-06-06)}{ - \subsection{Bug Fixes}{ - \itemize{ - \item Fix \code{iif} tests for new version of R. - } - } -} - -\section{version 0.0.17 (2024-05-03)}{ - \subsection{Bug Fixes}{ - \itemize{ - \item Fix \code{nswitch}. Thanks to Sebastian Krantz for raising an issue. - } - } - \subsection{Notes}{ - \itemize{ - \item Update copyright date in c files - - \item Fix note on CRAN regarding SETLENGTH - } - } -} - -\section{version 0.0.16 (2024-03-01)}{ - \subsection{Notes}{ - \itemize{ - \item Check if \code{"kit.nThread"} is defined before setting it to \code{1L} - } - } -} - -\section{version 0.0.15 (2023-10-01)}{ - \subsection{Notes}{ - \itemize{ - \item Correct typo in configure file - } - } -} - -\section{version 0.0.14 (2023-08-12)}{ - \subsection{Notes}{ - \itemize{ - \item Update configure file to extend support for GCC - - \item Correct warnings in NEWS.Rd (strong) - - \item Correct typo in funique.Rd thanks to @davidbudzynski - } - } -} - -\section{version 0.0.13 (2023-02-24)}{ - \subsection{Notes}{ - \itemize{ - \item Function \code{pprod} now returns double output even if inputs are integer - in line with \code{base::prod} - to avoid integer overflows. - - \item Update configure file - } - } -} - -\section{version 0.0.12 (2022-10-26)}{ - \subsection{New Features}{ - \itemize{ - \item Function \code{pcountNA} is equivalent to \code{pcount(..., value = NA)}. - - \item Function \code{pcountNA} and \code{pcount(..., value = NA)} allow \code{NA} counting with mixed data type (including \code{data.frame}). \code{pcountNA} also supports list-vectors as inputs and counts empty or \code{NULL} elements as \code{NA}. - - \item Functions \code{panyv}, \code{panyNA}, \code{pallv} and \code{pallNA} are added as efficient wrappers around \code{pcount} and \code{pcountNA}. They are parallel equivalents of scalar functions \code{base::anyNA} and \code{anyv}, \code{allv} and \code{allNA} in the 'collapse' R package. - - \item Functions \code{pfirst} and \code{plast} are added to efficiently obtain the row-wise first and last non-missing value or non-empty element of lists. They are parallel equivalents to the (column-wise) \code{ffirst} and \code{flast} functions in the 'collapse' R package. Implemented by @SebKrantz. - - \item Functions \code{psum/pprod/pmean} also support logical vectors as input. Implemented by @SebKrantz. - } - } - \subsection{Bug Fixes}{ - \itemize{ - \item Function \code{charToFact} was not returning proper results. Thanks to @alex-raw for raising an issue. - } - } - \subsection{Notes}{ - \itemize{ - \item Function \code{pprod} now returns double output even if inputs are integer - in line with \code{base::prod} - to avoid integer overflows. - - \item C compiler warnings on CRAN R-devel caused by compilation with -Wstrict-prototypes are now fixed. Declaration of functions without prototypes is depreciated in all versions of C. Thanks to Sebastian Krantz for the PR. - } - } -} - -\section{version 0.0.11 (2022-03-19)}{ - \subsection{New Features}{ - \itemize{ - \item Function \code{pcount} now supports data.frame. - } - } - \subsection{Bug Fixes}{ - \itemize{ - \item Function \code{pcount} now works with specific NA values, i.e. NA_real_, NA_character_ etc... - } - } -} - -\section{version 0.0.10 (2021-11-28)}{ - \subsection{New Features}{ - \itemize{ - \item Function \code{psum}, \code{pmean}, \code{pprod}, \code{pany} and \code{pall} now support lists. Thanks to Sebastian Krantz for the request and code suggestion. - } - } - \subsection{Bug Fixes}{ - \itemize{ - \item Function \code{topn} should now work for ALTREP object. Thanks to @ben-schwen for raising an issue. - } - } -} - -\section{version 0.0.9 (2021-09-12)}{ - \subsection{Notes}{ - \itemize{ - \item Re-organise header to prevent compilation errors with new version of Clang due to conflicts between R C headers and OpenMP. - } - } -} - -\section{version 0.0.8 (2021-08-21)}{ - \subsection{New Features}{ - \itemize{ - \item Function \code{funique} now preserves the attributes if the input is a - \code{data.table}, \code{tibble} or similar objects. Thanks to Sebastian Krantz for the request. - - \item Function \code{topn} now defaults to base R \code{order} for large value of \code{n}. - Please see updated documentation for more information \code{?kit::topn}. - - \item Function \code{charToFact} gains a new argument \code{addNA=TRUE} to be used - to include (or not) \code{NA} in levels of the output. - - \item Function \code{shareData}, \code{getData} and \code{clearData} implemented - to share data objects between \R sessions. These functions are experimental and might change in the future. - Feedback is welcome. Please see \code{?kit::shareData} for more information. - } - } - \subsection{Notes}{ - \itemize{ - \item Few \code{calloc} functions at C level have been replaced by R C API function - \code{Calloc} to avoid valgrind errors/warnings in Travis CI. - - \item Errors reported by \code{rchk} on CRAN have been fixed. - } - } -} - -\section{version 0.0.7 (2021-03-07)}{ - \subsection{New Features}{ - \itemize{ - \item Function \code{charToFact} gains a new argument \code{decreasing=FALSE} to be used - to order levels of the output in decreasing or increasing order. - - \item Function \code{topn} gains a new argument \code{index=TRUE} to be used return - index (\code{TRUE}) or values (\code{FALSE}) of input vector. - } - } - \subsection{Bug Fixes}{ - \itemize{ - \item Some tests of memory access errors using valgrind and AddressSanitizer were reported by CRAN. - An attempt to fix these errors has been submitted as part of this package version. It also seems that - these same errors were causing some tests to fail for \code{funique} and \code{psort} on some platforms. - } - } - \subsection{Notes}{ - \itemize{ - \item Functions \code{pmean}, \code{pprod} and \code{psum} will result - in error if used with factors. Documentation has been updated. - } - } -} - -\section{version 0.0.6 (2021-02-21)}{ - \subsection{New Features}{ - \itemize{ - \item Function \code{funique} and \code{fduplicated} gain an additional argument - \code{fromLast=FALSE} to indicate whether the search should start from the end or beginning - \href{https://github.com/2005m/kit/pull/11}{PR#11}. - - \item Functions \code{pall}, \code{pany}, \code{pmean}, - \code{pprod} and \code{psum} accept \code{data.frame} as input - \href{https://github.com/2005m/kit/pull/15}{PR#15}. Please see documentation for more - information. - - \item Function \code{charToFact} is equivalent to to base R \code{as.factor} but is much - quicker and only converts character vector to factor. Note that it is parallelised. For more - details and benchmark please see \code{?kit::charToFact}. - - \item Function \code{psort} is experimental and equivalent to to base R \code{sort} - but is only for character vector. It can sort by "C locale" or by "R session locale". - For more details and benchmark please see \code{?kit::psort}. - } - } - \subsection{Notes}{ - \itemize{ - \item A few OpenMP directives were missing for functions \code{vswitch} and - \code{nswitch} for character vectors. These have been added in - \href{https://github.com/2005m/kit/pull/12}{PR#12}. - - \item Function \code{funique} was not preserving attributes for character, logical and - complex vectors/data.frames. Thanks to Sebastian Krantz (@SebKrantz) for bringing that to my - attention. This has been fixed in \href{https://github.com/2005m/kit/pull/13}{PR#13}. - - \item Functions \code{funique} and \code{uniqLen} should now be faster for - \code{factor} and \code{logical} vectors \href{https://github.com/2005m/kit/pull/14}{PR#14}. - } - } -} - -\section{version 0.0.5 (2020-11-21)}{ - \subsection{New Features}{ - \itemize{ - \item Function \code{uniqLen(x)} is equivalent to base R \code{length(unique(x))} and - \code{uniqueN} in package \CRANpkg{data.table}. Function \code{uniqLen}, implemented in C, supports - vectors, \code{data.frame} and \code{matrix}. It should be faster than these functions. For more - details and benchmark please see \code{?kit::uniqLen}. - - \item Function \code{vswitch} now supports mixed encoding and gains an additional argument - \code{checkEnc=TRUE}. Thanks to Xianying Tan (@shrektan) for the request and review - \href{https://github.com/2005m/kit/pull/7}{PR#7}. - - \item Function \code{nswitch} is a nested version of function \code{vswitch} - and also supports mixed encoding. Please see please see \code{?kit::nswitch} for further details. - Thanks to Xianying Tan (@shrektan) for the request and review \href{https://github.com/2005m/kit/pull/10}{PR#10}. - } - } - \subsection{Notes}{ - \itemize{ - \item Small algorithmic improvement for functions \code{fduplicated}, \code{funique} - and \code{countOccur} for \code{vectors}, \code{data.frame} and \code{matrix}. - - \item A tests folder has been added to the source package to track coverage and bugs. - } - } - \subsection{C-Level Facilities}{ - \itemize{ - \item Function \code{nif} has been split into two distinctive functions at C level, - one has its arguments evaluated in a lazy way and is for R users and the other one (nifInternalR) - is not lazy and is intended for usage at C level. - } - } -} - -\section{version 0.0.4 (2020-07-21)}{ - \subsection{New Features}{ - \itemize{ - \item Function \code{countOccur(x)}, implemented in C, is comparable to \code{base} - \R function \code{table}. It returns a \code{data.frame} and is between 3 to 50 times faster. - For more details, please see \code{?kit::countOccur}. - - \item Functions \code{funique} and \code{fduplicated} now support matrices. - Additionally, these two functions should also have better performance compare to previous release. - - \item Functions \code{topn} has an additional argument \code{hasna=TRUE} to indicates whether - data contains \code{NA} value or not. If the data does not contain \code{NA} values, the function - should be faster. - } - } - \subsection{C-Level Facilities}{ - \itemize{ - \item A few C functions have been added to subset \code{data.frame} and \code{matrix} as well as - do other operations. These functions are not exported or visible to the user but might become - available and callable at C level in the future. - } - } - \subsection{Bug Fixes}{ - \itemize{ - \item Function \code{fpos} was not properly handling \code{NaN} and \code{NA} for complex - and double. This should now be fixed. The function has also been changed in case the 'needle' and - 'haysatck' are vectors so that a vector is returned. - - \item Functions \code{funique} and \code{fduplicated} were not properly handling - data containing \code{POSIX} data. This has now been fixed. - } - } -} - -\section{version 0.0.3 (2020-06-21)}{ - \subsection{New Features}{ - \itemize{ - \item Functions \code{fduplicated(x)} and \code{funique(x)}, implemented in C, - are comparable to \code{base} \R functions \code{duplicated} and \code{unique}. For more details, - please see \code{?kit::funique}. - - \item Functions \code{psum} and \code{pprod} have now better performance for - type double and complex. - } - } - \subsection{Bug Fixes}{ - \itemize{ - \item Function \code{count(x, y)} now checks that \code{x} and \code{y} have the same class and - levels. So does \code{pcount}. - - \item Function \code{pmean} was not callable at C level because of a typo. This is now fixed. - } - } -} - -\section{version 0.0.2 (2020-05-22)}{ - \subsection{New Features}{ - \itemize{ - \item Function \code{count(x, value)}, implemented in C, to simply count the number of times - an element \code{value} occurs in a vector or in a list \code{x}. For more details, please see - \code{?kit::count}. - - \item Function \code{pmean(..., na.rm=FALSE)}, \code{pall(..., na.rm=FALSE)}, - \code{pany(..., na.rm=FALSE)} and \code{pcount(..., value)}, implemented in C, - are similar to already available function \code{psum} and \code{pprod}. These - functions respectively apply base \R functions \code{mean}, \code{all} and \code{any} element-wise. - For more details, benchmarks and help, please see \code{?kit::pmean}. - } - } - \subsection{Bug Fixes}{ - \itemize{ - \item Fix Solaris Unicode warnings for NEWS file. Benchmarks have been moved from the NEWS file to - each function Rd file. - - \item Fix some \code{NA} edge cases for \code{pprod} and \code{psum} so these - functions behave more like base \R function \code{prod} and \code{sum}. - - \item Fix installation errors for version of R (<3.5.0). - } - } -} - -\section{version 0.0.1 (2020-05-03)}{ - \subsection{Initial Release}{ - \itemize{ - \item Function \code{fpos(needle, haystack, all=TRUE, overlap=TRUE)}, implemented in C, is - inspired by base function \code{which} when used in the following form - \code{which(x == y, arr.ind =TRUE}). Function \code{fpos} returns the index(es) or position(s) - of a matrix/vector within a larger matrix/vector. Please see \code{?kit::fpos} for more - details. - - \item Function \code{iif(test, yes, no, na=NULL, tprom=FALSE, nThread=getOption("kit.nThread"))}, - originally contributed as \code{fifelse} in package \CRANpkg{data.table}, was moved to package kit - to be developed independently. Unlike the current version of \code{fifelse}, \code{iif} allows - type promotion like base function \code{ifelse}. For further details about the differences - with \code{fifelse}, as well as \code{hutils::if_else} and \code{dplyr::if_else}, please see - \code{?kit::iif}. - - \item Function \code{nif(..., default=NULL)}, implemented in C, is inspired by - \emph{SQL CASE WHEN}. It is comparable to \CRANpkg{dplyr} function \code{case_when} however it - evaluates it arguments in a lazy way (i.e only when needed). Function \code{nif} was - originally contributed as function \code{fcase} in the \CRANpkg{data.table} package but then moved - to package kit so its development may resume independently. Please see \code{?kit::nif} for - more details. - - \item Function \code{pprod(..., na.rm=FALSE)} and \code{psum(..., na.rm=FALSE)}, - implemented in C, are inspired by base function \code{pmin} and \code{pmax}. These new - functions work only for integer, double and complex types and do not recycle vectors. Please - see \code{?kit::psum} for more details. - - \item Function \code{setlevels(x, old, new, skip_absent=FALSE)}, implemented in C, - may be used to set levels of a factor object. Please see \code{?kit::setlevels} for more details. - - \item Function \code{topn(vec, n=6L, decreasing=TRUE)}, implemented in C, returns the top - largest or smallest \code{n} values for a given numeric vector \code{vec}. It is inspired by - \code{dplyr::top_n} and equivalent to base functions order and sort in specific cases as shown - in the documentation. Please see \code{?kit::topn} for more details. - - \item Function \code{vswitch(x, values, outputs, default=NULL, nThread=getOption("kit.nThread"))} - , implemented in C, is a vectorised version of \code{base} \R function \code{switch}. This - function can also be seen as a particular case of function \code{nif}. Please see - \code{?kit::switch} for more details. - } - } -} diff --git a/man/count.Rd b/man/count.Rd index 5748f26..9475cb8 100644 --- a/man/count.Rd +++ b/man/count.Rd @@ -7,9 +7,9 @@ Simple functions to count the number of times an element occurs. } \usage{ - count(x, value) - countNA(x) - countOccur(x) +count(x, value) +countNA(x) +countOccur(x) } \arguments{ \item{x}{ A vector or list for \code{countNA}. A vector for \code{count} and a vector or \code{data.frame} for \code{countOccur}.} diff --git a/man/fpos.Rd b/man/fpos.Rd index 455b77b..a89bb26 100644 --- a/man/fpos.Rd +++ b/man/fpos.Rd @@ -2,10 +2,10 @@ \alias{fpos} \title{ Find a matrix position inside a larger matrix } \description{ -The function \code{fpos} returns the locations (row and column index) where a small matrix may be found in a larger matrix. The function also works with vectors. +The function \code{fpos} returns the locations (row and column index) where a small matrix may be found in a larger matrix. The function also works with vectors. } \usage{ - fpos(needle, haystack, all=TRUE, overlap=TRUE) +fpos(needle, haystack, all=TRUE, overlap=TRUE) } \arguments{ \item{needle}{ A matrix or vector to search for in the larger matrix or vector \code{haystack}. Note that the \code{needle} dimensions (row and column size) must be smaller than the \code{haystack} dimensions. } @@ -24,10 +24,10 @@ small_matrix = matrix(c(14, 15, 24, 25), nrow = 2) fpos(small_matrix, big_matrix) -# Example 2: find a vector inside a larger one +# Example 2: find a vector inside a larger one fpos(14:15, 1:30) -# Example 3: +# Example 3: big_matrix = matrix(c(1:5), nrow = 10, ncol = 5) small_matrix = matrix(c(2:3), nrow = 2, ncol = 2) diff --git a/man/funique.Rd b/man/funique.Rd index 5933d36..6843ea0 100644 --- a/man/funique.Rd +++ b/man/funique.Rd @@ -7,9 +7,9 @@ Similar to base R functions \code{duplicated} and \code{unique}, \code{fduplicated} and \code{funique} are slightly faster for vectors and much faster for \code{data.frame}. Function \code{uniqLen} is equivalent to base R \code{length(unique)} or \code{data.table::uniqueN}. } \usage{ - fduplicated(x, fromLast = FALSE) - funique(x, fromLast = FALSE) - uniqLen(x) +fduplicated(x, fromLast = FALSE) +funique(x, fromLast = FALSE) +uniqLen(x) } \arguments{ \item{x}{ A vector, data.frame or matrix.} diff --git a/man/iif.Rd b/man/iif.Rd index 3426b45..b8991c2 100644 --- a/man/iif.Rd +++ b/man/iif.Rd @@ -5,7 +5,7 @@ \code{iif} is a faster and more robust replacement of \code{\link[base]{ifelse}}. It is comparable to \code{dplyr::if_else}, \code{hutils::if_else} and \code{data.table::fifelse}. It returns a value with the same length as \code{test} filled with corresponding values from \code{yes}, \code{no} or eventually \code{na}, depending on \code{test}. It does not support S4 classes. } \usage{ - iif(test, yes, no, na=NULL, tprom=FALSE, nThread=getOption("kit.nThread")) +iif(test, yes, no, na=NULL, tprom=FALSE, nThread=getOption("kit.nThread")) } \arguments{ \item{test}{ A logical vector. } @@ -17,7 +17,7 @@ \details{ In contrast to \code{\link[base]{ifelse}} attributes are copied from \code{yes} to the output. This is useful when returning \code{Date}, \code{factor} or other classes. Like \code{dplyr::if_else} and \code{hutils::if_else}, the \code{na} argument is by default set to \code{NULL}. This argument is set to \code{NA} in data.table::fifelse. -Similarly to \code{dplyr::if_else} and when \code{tprom=FALSE}, \code{iif} requires same type for arguments \code{yes} and \code{no}. This is not strictly the case for \code{data.table::fifelse} which will coerce integer to double. +Similarly to \code{dplyr::if_else} and when \code{tprom=FALSE}, \code{iif} requires same type for arguments \code{yes} and \code{no}. This is not strictly the case for \code{data.table::fifelse} which will coerce integer to double. When \code{tprom=TRUE}, \code{iif} behavior is similar to \code{base::ifelse} in the sense that it will promote or coerce \code{yes} and \code{no}to the "highest" used type. Note, however, that unlike \code{base::ifelse} attributes are still conserved. } \value{ diff --git a/man/nif.Rd b/man/nif.Rd index ea44941..8026d9b 100644 --- a/man/nif.Rd +++ b/man/nif.Rd @@ -5,7 +5,7 @@ \code{nif} is a fast implementation of SQL \code{CASE WHEN} statement for R. Conceptually, \code{nif} is a nested version of \code{\link{iif}} (with smarter implementation than manual nesting). It is not the same but it is comparable to \code{dplyr::case_when} and \code{data.table::fcase}. } \usage{ - nif(..., default=NULL) +nif(..., default=NULL) } \arguments{ \item{...}{ A sequence consisting of logical condition (\code{when})-resulting value (\code{value}) \emph{pairs} in the following order \code{when1, value1, when2, value2, ..., whenN, valueN}. Logical conditions \code{when1, when2, ..., whenN} must all have the same length, type and attributes. Each \code{value} may either share length with \code{when} or be length 1. Please see Examples section for further details.} diff --git a/man/psort.Rd b/man/psort.Rd index 1c11b4f..0861ecd 100644 --- a/man/psort.Rd +++ b/man/psort.Rd @@ -6,8 +6,8 @@ It is currently experimental and might change in the future. Use with caution. } \usage{ - psort(x, decreasing=FALSE, na.last=NA, - nThread=getOption("kit.nThread"),c.locale=TRUE) +psort(x, decreasing=FALSE, na.last=NA, + nThread=getOption("kit.nThread"),c.locale=TRUE) } \arguments{ \item{x}{ A vector of type character. If other, it will default to \code{base::sort}} @@ -31,12 +31,12 @@ identical(psort(x, c.locale=TRUE), sort(x, method="radix")) # strings = as.character(as.hexmode(1:1000)) # x = sample(strings, 1e8, replace=TRUE) # system.time({kit::psort(x, na.last = TRUE, nThread = 1L)}) -# user system elapsed +# user system elapsed # 2.833 0.434 3.277 # system.time({sort(x,method="radix",na.last = TRUE)}) -# user system elapsed +# user system elapsed # 5.597 0.559 6.176 # system.time({x[order(x,method="radix",na.last = TRUE)]}) -# user system elapsed +# user system elapsed # 5.561 0.563 6.143 } diff --git a/man/psum.Rd b/man/psum.Rd index ce3b1cf..825416e 100644 --- a/man/psum.Rd +++ b/man/psum.Rd @@ -14,22 +14,22 @@ \alias{plast} \title{Parallel (Statistical) Functions} \description{ -Vector-valued (statistical) functions operating in parallel over vectors passed as arguments, or a single list of vectors (such as a data frame). Similar to \code{\link{pmin}} and \code{\link{pmax}}, except that these functions do not recycle vectors. +Vector-valued (statistical) functions operating in parallel over vectors passed as arguments, or a single list of vectors (such as a data frame). Similar to \code{\link{pmin}} and \code{\link{pmax}}, except that these functions do not recycle vectors. } \usage{ - psum(..., na.rm = FALSE) - pprod(..., na.rm = FALSE) - pmean(..., na.rm = FALSE) - pfirst(...) # (na.rm = TRUE) - plast(...) # (na.rm = TRUE) - pall(..., na.rm = FALSE) - pallNA(...) - pallv(..., value) - pany(..., na.rm = FALSE) - panyNA(...) - panyv(..., value) - pcount(..., value) - pcountNA(...) +psum(..., na.rm = FALSE) +pprod(..., na.rm = FALSE) +pmean(..., na.rm = FALSE) +pfirst(...) # (na.rm = TRUE) +plast(...) # (na.rm = TRUE) +pall(..., na.rm = FALSE) +pallNA(...) +pallv(..., value) +pany(..., na.rm = FALSE) +panyNA(...) +panyv(..., value) +pcount(..., value) +pcountNA(...) } \arguments{ \item{...}{ suitable (atomic) vectors of the same length, or a single list of vectors (such as a \code{data.frame}). See Details on the allowed data types for each function, and Examples.} @@ -43,14 +43,14 @@ Functions \code{psum}, \code{pprod} work for integer, logical, double and comple \code{pany} and \code{pall} are derived from base functions \code{all} and \code{any} and only allow logical inputs. -\code{pcount} counts the occurrence of \code{value}, and expects arguments of the same data type (except for \code{value = NA}). \code{pcountNA} is equivalent to \code{pcount} with \code{value = NA}, and they both allow \code{NA} counting in mixed-type data. \code{pcountNA} additionally supports list vectors and counts empty or \code{NULL} elements as \code{NA}. +\code{pcount} counts the occurrence of \code{value}, and expects arguments of the same data type (except for \code{value = NA}). \code{pcountNA} is equivalent to \code{pcount} with \code{value = NA}, and they both allow \code{NA} counting in mixed-type data. \code{pcountNA} additionally supports list vectors and counts empty or \code{NULL} elements as \code{NA}. -Functions \code{panyv/pallv} are wrappers around \code{pcount}, and \code{panyNA/pallNA} are wrappers around \code{pcountNA}. They return a logical vector instead of the integer count. +Functions \code{panyv/pallv} are wrappers around \code{pcount}, and \code{panyNA/pallNA} are wrappers around \code{pcountNA}. They return a logical vector instead of the integer count. None of these functions recycle vectors i.e. all input vectors need to have the same length. All functions support long vectors with up to \code{2^64-1} elements. } \value{ -\code{psum/pprod/pmean} return the sum, product or mean of all arguments. The value returned will be of the highest argument type (integer < double < complex). \code{pprod} only returns double or complex. \code{pall[v/NA]} and \code{pany[v/NA]} return a logical vector. \code{pcount[NA]} returns an integer vector. \code{pfirst/plast} return a vector of the same type as the inputs. +\code{psum/pprod/pmean} return the sum, product or mean of all arguments. The value returned will be of the highest argument type (integer < double < complex). \code{pprod} only returns double or complex. \code{pall[v/NA]} and \code{pany[v/NA]} return a logical vector. \code{pcount[NA]} returns an integer vector. \code{pfirst/plast} return a vector of the same type as the inputs. } \seealso{ Package 'collapse' provides column-wise and scalar-valued analogues to many of these functions. @@ -61,7 +61,7 @@ x = c(1, 3, NA, 5) y = c(2, NA, 4, 1) z = c(3, 4, 4, 1) -# Example 1: psum +# Example 1: psum psum(x, y, z, na.rm = FALSE) psum(x, y, z, na.rm = TRUE) @@ -105,7 +105,7 @@ pmean(iris[,1:2]) # x = rnorm(n) # 763 Mb # y = rnorm(n) # z = rnorm(n) -# +# # microbenchmark::microbenchmark( # kit=psum(x, y, z, na.rm = TRUE), # base=rowSums(do.call(cbind,list(x, y, z)), na.rm=TRUE), @@ -119,7 +119,7 @@ pmean(iris[,1:2]) # x = sample(c(TRUE, FALSE, NA), n, TRUE) # 382 Mb # y = sample(c(TRUE, FALSE, NA), n, TRUE) # z = sample(c(TRUE, FALSE, NA), n, TRUE) -# +# # microbenchmark::microbenchmark( # kit=pany(x, y, z, na.rm = TRUE), # base=sapply(1:n, function(i) any(x[i],y[i],z[i],na.rm=TRUE)), diff --git a/man/shareData.Rd b/man/shareData.Rd index bb9fce2..c2fbe8b 100644 --- a/man/shareData.Rd +++ b/man/shareData.Rd @@ -7,9 +7,9 @@ Experimental functions that enable the user to share a R object between 2 \R sessions. } \usage{ - shareData(data, map_name, verbose=FALSE) - getData(map_name, verbose=FALSE) - clearData(x, verbose=FALSE) +shareData(data, map_name, verbose=FALSE) +getData(map_name, verbose=FALSE) +clearData(x, verbose=FALSE) } \arguments{ \item{data}{ A \R object like a vector or a \code{data.frame}.} diff --git a/man/topn.Rd b/man/topn.Rd index 075c774..7dcb791 100644 --- a/man/topn.Rd +++ b/man/topn.Rd @@ -2,12 +2,12 @@ \alias{topn} \title{ Top N values index} \description{ - \code{topn} is used to get the indices of the few values of an input. This is an extension of \code{\link{which.max}}/\code{\link{which.min}} which provide \emph{only} the first such index. - + \code{topn} is used to get the indices of the few values of an input. This is an extension of \code{\link{which.max}}/\code{\link{which.min}} which provide \emph{only} the first such index. + The output is the same as \code{order(vec)[1:n]}, but internally optimized not to sort the irrelevant elements of the input (and therefore much faster, for small \code{n} relative to input size). } \usage{ - topn(vec, n=6L, decreasing=TRUE, hasna=TRUE, index=TRUE) +topn(vec, n=6L, decreasing=TRUE, hasna=TRUE, index=TRUE) } \arguments{ \item{vec}{ A numeric vector of type numeric or integer. Other types are not supported yet. } @@ -23,7 +23,7 @@ \examples{ x = rnorm(1e4) -# Example 1: index of top 6 negative values +# Example 1: index of top 6 negative values topn(x, 6L, decreasing=FALSE) order(x)[1:6] diff --git a/man/vswitch.Rd b/man/vswitch.Rd index 409d646..0ee33ff 100644 --- a/man/vswitch.Rd +++ b/man/vswitch.Rd @@ -6,12 +6,12 @@ \code{vswitch}/ \code{nswitch} is a vectorised version of \code{base} function \code{switch}. This function can also be seen as a particular case of function \code{nif}, as shown in examples below, and should also be faster. } \usage{ - vswitch(x, values, outputs, default=NULL, - nThread=getOption("kit.nThread"), - checkEnc=TRUE) - nswitch(x, ..., default=NULL, - nThread=getOption("kit.nThread"), - checkEnc=TRUE) +vswitch(x, values, outputs, default=NULL, + nThread=getOption("kit.nThread"), + checkEnc=TRUE) +nswitch(x, ..., default=NULL, + nThread=getOption("kit.nThread"), + checkEnc=TRUE) } \arguments{ \item{x}{A vector or list.} diff --git a/pkgdown/extra.css b/pkgdown/extra.css new file mode 100644 index 0000000..522ed42 --- /dev/null +++ b/pkgdown/extra.css @@ -0,0 +1,68 @@ +.navbar-nav .nav-item > .nav-link { + margin-right: 10px; +} +.template-home img.logo { + width: 150px; +} +img.logo { + width: 150px; + margin-left: 30px; +} +.h1, .h2, .h3, h1, h2, h3 { + margin-top: 35px; + margin-bottom: 10px; +} +body { + font-size: 100%; +} +dd { + padding-left: 1.5rem !important; + margin-bottom: 0.5rem !important; +} +/* +p { + font-size: 0.875em; 14px/16=0.875em +} +*/ +.fa-bluesky { + font-family: "Font Awesome 6 Brands"; + font-weight: 400; +} +span.fa.fa-bluesky { + font-size: 15.5px; +} +@media screen and (min-width: 1000px) { + span.fa.fa-bluesky { + padding-left: 12px; + } +} +span.fa.fa-twitter { + font-size: 18px; +} +span.fa.fa-github { + font-size: 18px; + margin-right: 100px; +} +a { + color: #0089b3; /* #007da3 */ +} +a:hover { + color: #005873; /* #027ca1; */ +} +pre { + color: #cccccc; +} +small.nav-text.text-muted { + color: #999a9c !important; /* #8e8c84 #999a9c; -> Same as navbar */ +} + +.form-control, +.form-control::placeholder { + color: #999a9c !important; +} + +[data-bs-theme="dark"] { + --bs-body-color: #cccccc !important; + --bs-secondary-color: #cccccc !important; + --bs-tertiary-color: #999a9c !important; +} diff --git a/vignettes/introduction.Rmd b/vignettes/introduction.Rmd new file mode 100644 index 0000000..32778a6 --- /dev/null +++ b/vignettes/introduction.Rmd @@ -0,0 +1,262 @@ +--- +title: "Introduction to kit" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Introduction to kit} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +```{r setup} +library(kit) +``` + +## Overview + +**kit** provides a collection of fast utility functions implemented in C for data manipulation in R. It serves as a lightweight, high-performance toolkit for tasks that are either slow or cumbersome in base R, such as row-wise operations, vectorized conditionals, and duplicate detection. + +Key features include: + +* **Parallel statistical functions**: Row-wise operations (`psum`, `pmean`, `pfirst`) using OpenMP. +* **Vectorized conditionals**: Fast `if-else` logic (`iif`, `nif`, `vswitch`) that preserves attributes. +* **Efficient set operations**: Faster `unique`, `duplicated`, and `count` for vectors and data frames. +* **Partial sorting**: Retrieve top N elements without sorting the entire vector (`topn`). +* **Factor utilities**: Fast character-to-factor conversion (`charToFact`) and level manipulation (`setlevels`). + +Most functions are implemented in C and support multi-threading where applicable, making them significantly faster than their base R equivalents on large datasets. + +## Parallel Statistical Functions + +Computing row-wise statistics across multiple vectors or data frame columns is a common task. While base R has `pmin()` and `pmax()`, it lacks efficient equivalents for sum, mean, or product. **kit** fills this gap. + +### Row-wise Arithmetic + +`psum()`, `pmean()`, and `pprod()` compute parallel sum, mean, and product respectively. They accept multiple vectors or a single list/data frame. + +```{r} +x <- c(1, 3, NA, 5) +y <- c(2, NA, 4, 1) +z <- c(3, 4, 4, 1) + +# Parallel sum +psum(x, y, z, na.rm = TRUE) + +# Parallel mean +pmean(x, y, z, na.rm = TRUE) +``` + +They are particularly useful for data frames: + +```{r} +df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)) +psum(df) +``` + +### Coalescing Values + +`pfirst()` and `plast()` return the first or last non-missing value across a set of vectors. This is equivalent to the SQL `COALESCE` function (for `pfirst`). + +```{r} +primary <- c(NA, 2, NA, 4) +secondary <- c(1, NA, 3, NA) +fallback <- c(0, 0, 0, 0) + +# Take first available value +pfirst(primary, secondary, fallback) +``` + +### Logical and Count Operations + +You can check for conditions or count values row-wise with `pall`, `pany`, and `pcount`. + +```{r} +a <- c(TRUE, FALSE, NA, TRUE) +b <- c(TRUE, NA, TRUE, FALSE) +c <- c(NA, TRUE, FALSE, TRUE) + +# Any TRUE per row? +pany(a, b, c, na.rm = TRUE) + +# Count NAs per row +pcountNA(a, b, c) + +# Count specific value (e.g., TRUE) per row +pcount(a, b, c, value = TRUE) +``` + +## Vectorized Conditionals + +### Fast If-Else (`iif`) + +Base R's `ifelse()` is known to be slow and often strips attributes (like `Date` class or factor levels). `iif()` is a faster, more robust alternative that preserves attributes from the `yes` argument. + +```{r} +dates <- as.Date(c("2024-01-01", "2024-01-02", "2024-01-03")) + +# Base ifelse strips class +class(ifelse(dates > "2024-01-01", dates, dates - 1)) + +# iif preserves class +class(iif(dates > "2024-01-01", dates, dates - 1)) +``` + +It also supports explicit `NA` handling: + +```{r} +x <- c(-2, -1, NA, 1, 2) +iif(x > 0, "positive", "non-positive", na = "missing") +``` + +### Nested Conditionals (`nif`) + +For multiple conditions, `nif()` offers a cleaner, more efficient syntax than nested `ifelse()` calls, similar to SQL's `CASE WHEN`. + +```{r} +score <- c(95, 82, 67, 45, 78) + +nif( + score >= 90, "A", + score >= 80, "B", + score >= 70, "C", + score >= 60, "D", + default = "F" +) +``` + +### Vectorized Switch (`vswitch`, `nswitch`) + +`vswitch()` maps input values to outputs efficiently. + +```{r} +status_code <- c(1L, 2L, 3L, 1L, 4L) + +vswitch( + x = status_code, + values = c(1L, 2L, 3L), + outputs = c("pending", "approved", "rejected"), + default = "unknown" +) +``` + +For pairwise syntax, `nswitch()` pairs values and outputs directly. + +```{r} +nswitch(status_code, + 1L, "pending", + 2L, "approved", + 3L, "rejected", + default = "unknown" +) +``` + +It can also replace with values from other vectors (columns), mixing scalars and vectors: + +```{r} +df <- data.frame( + code = c(1, 2, 1, 3, 2), + val_a = c(10, 20, 30, 40, 50), + val_b = c(100, 200, 300, 400, 500) +) +with(df, nswitch(code, + 1, val_a, + 2, val_b, + 3, 0, + default = NA_real_ +)) +``` + +## Fast Unique and Duplicates + +**kit** provides optimized versions of `unique()` and `duplicated()` that are significantly faster for vectors and data frames. + +### Unique Values and Duplicates + +```{r} +vec <- c("a", "b", "a", "c", "b") + +# Get unique values +funique(vec) + +# Check for duplicates +fduplicated(vec) +``` + +`uniqLen()` efficiently counts the number of unique elements without allocating the unique vector itself: + +```{r} +df <- data.frame( + x = c(1, 1, 2, 2), + y = c("a", "a", "b", "b") +) +uniqLen(df) +funique(df) +``` + +### Counting Occurrences + +`countOccur()` produces a frequency table (similar to `table()` or `dplyr::count()`) but returns a standard data frame. + +```{r} +countOccur(c("apple", "banana", "apple", "cherry")) +``` + +## Sorting and Utilities + +### Partial Sorting (`topn`) + +Sorting a large vector just to get the top few elements is inefficient. `topn()` uses a partial sorting algorithm to retrieve the top (or bottom) $N$ indices or values. + +```{r} +set.seed(42) +x <- rnorm(1000) + +# Get indices of top 5 values +topn(x, n = 5) + +# Get the actual values (decreasing = FALSE for bottom values) +topn(x, n = 5, decreasing = FALSE, index = FALSE) +``` + +### Factor Manipulation + +`charToFact()` is a fast alternative to `as.factor()` for character vectors, with control over `NA` levels. + +```{r} +charToFact(c("a", "b", NA, "a")) +``` + +`setlevels()` allows you to change factor levels by reference (in-place), avoiding object copying. + +### Finding Positions (`fpos`) + +`fpos()` finds the positions of a pattern (needle) within a vector (haystack). It can be used to find occurrences of one vector inside another. + +```{r} +haystack <- c(1, 2, 3, 4, 1, 2, 5) +needle <- c(1, 2) + +fpos(needle, haystack) +``` + +## Summary + +| Task | kit function | Base R equivalent | +|:---|:---|:---| +| **Row-wise sum** | `psum()` | `rowSums(cbind(...))` | +| **Row-wise mean** | `pmean()` | `rowMeans(cbind(...))` | +| **First non-NA** | `pfirst()` | `apply(..., 1, function(x) x[!is.na(x)][1])` | +| **Fast if-else** | `iif()` | `ifelse()` | +| **Nested if-else** | `nif()` | Nested `ifelse()` | +| **Switch** | `vswitch()` | `match()` + indexing | +| **Unique values** | `funique()` | `unique()` | +| **Top N indices** | `topn()` | `order()[1:n]` | +| **Char to Factor** | `charToFact()` | `as.factor()` | + +For comprehensive details and performance benchmarks, please refer to the individual function documentation.