-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
111 lines (79 loc) · 4.92 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# BertopicR
<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/jpcompartir/BertopicR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/jpcompartir/BertopicR/actions/workflows/R-CMD-check.yaml)
[![pkgdown](https://github.com/jpcompartir/BertopicR/actions/workflows/pkgdown.yaml/badge.svg)](https://github.com/jpcompartir/BertopicR/actions/workflows/pkgdown.yaml)
<!-- badges: end -->
The goal of BertopicR is to allow R users to access bertopic's topic modelling suite in R via Reticulate. Bertopic was written and developed by Maarten Grootendorst (and other contributors!).
The package does not aim to implement every feature of bertopic, and is designed with specific end users in mind who may not be experienced programmers or developers. You may submit issues for feature requests; however, it may be faster to go direct to the original, Python library which has excellent documentation. [[BERTopic documentation](https://maartengr.github.io/BERTopic/index.html)]
We have tried to stay true to the Python library, whilst developing a package which has an 'R feel' i.e. uses functions more than OOP. In places we have changed the names of arguments, for example in the Python library BERTopic() takes `hdbscan_model = x`, but we have opted for `clustering_model = x`. The reason for this is that `hdbscan_model =` is an artifact from the early days of bertopic and the author wants to ensure code is backwards compatible, but if the package were being developed now it's likely the author would opt for `clustering_model =`. There are other such changes to be aware of.
The package currently installs an exact version of bertopic - 0.15.0, features introduced after this version will take time to, or may never, reach this package.
## Installation
Before installing bertopic make sure that you have miniconda installed, if you don't:
```{r, eval=FALSE}
library(reticulate) #install.packages("reticulate") if you don't have this already or aren't sure how to install.
reticulate::install_miniconda()
```
Once you have reticulate and miniconda installed, you can then install the development version of BertopicR from [GitHub](https://github.com/) with:
``` {r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("jpcompartir/BertopicR")
library(BertopicR)
#Check your environment has been loaded correctly and bertopic has been installed:
BertopicR::check_python_dependencies()
```
If you receive the message:
"bertopic not in installed packages of current environment..."
run:
```{r, eval = FALSE}
BertopicR::install_python_dependencies()
```
## Quickstart
```{r, echo = FALSE}
library(BertopicR)
```
BertopicR ships with a dataset of unstructured text data `bert_example_data`
```{r}
data <- BertopicR::bert_example_data
embedder <- bt_make_embedder("all-minilm-l6-v2")
embeddings <- bt_do_embedding(embedder, documents = data$message, batch_size = 16L)
reducer <- bt_make_reducer_umap(n_neighbours = 10L, n_components = 10L, metric = "cosine")
clusterer <- bt_make_clusterer_hdbscan(min_cluster_size = 20L, metric = "euclidean", cluster_selection_method = "eom", min_samples = 10L)
topic_model <- bt_compile_model(embedding_model = embedder,
reduction_model = reducer,
clustering_model = clusterer)
#Fit the model
fitted_model <- bt_fit_model(model = topic_model,
documents = data$message,
embeddings = embeddings)
fitted_model$get_topic_info() %>%
dplyr::tibble()
```
## Error Messages and causes
We have tried to provide informative error messages but sometimes you may be faced with error
messages that have arisen from the parent python function used. If you get an error that begins
with "Error in py_call_impl(callable, call_args\$unnamed, call_args\$named) : ", you can use
reticulate::py_last_error() to trace the origin of the error message.
Note that one common cause of error messages when working with python functions in R arises when the
user fails to specify integer numbers as integer type. In R, integers are of type "numeric" by default,
python functions typically require them to be explicitly specified as integer type. For any formal function
arguments, this conversion is dealt with within in the function, however you should be aware of this if specifying extra arguments to any function. In R, this is achieved by placing an L after the number or using the as.integer() function:
```{r}
# numbers default to "numeric"
class(4)
# we can force them to be "integer"
class(4L)
class(as.integer(4))
```