Feature Request | Add possibility to remove stopwords and short documents to the as.instance_list() function

### Background

The `as.instance_list()` function provides a nice way to pass a `partition_bundle` object (from `polmineR`) to the workflow as shown in the vignette [here](https://github.com/PolMine/biglda/blob/master/vignettes/supermodel.Rmd). 

### Issue

What is missing, as far as I can see at least, is the possibility to reduce the vocabulary of the token streams which are passed to the mallet instance list (i.e. removing stopwords, punctuation, etc.). 

In addition, sometimes it could be useful to remove very short documents before fitting the topic model. Of course, this kind of filtering could be done before passing the `partition_bundle` to `as.instance_list()`. However, if you want to remove stopwords first and then filter out short documents (which might be short now because of the removal of stopwords), it could be nice to do it within the function.

### Idea

Within `as.instance_list()` the token streams of the partitions in the partition_bundle are retrieved using the `get_token_stream()` method of `polmineR`. See the code below:

https://github.com/PolMine/biglda/blob/bd7a88406c6865853653861d786b02f5eef0ed20/R/as.instance_list.R#L75

Now I thought that subsetting these token streams should be possible by utilizing the full potential of the `get_token_stream()` method of `polmineR`. As documented there (`?get_token_stream`), there is a `subset` argument which can be used to pass expressions to the function which allow for some - also quite elaborate - subsetting.

As a next step, I tried to add this to the original function. Instead of line 75 quoted above, I tried to create a slightly modified version of this which includes the subset argument:

```
  token_stream_list <- get_token_stream(
    x,
    p_attribute = p_attribute,
    subset = {!get(p_attribute) %in% terms_to_drop},
    progress = TRUE
  )
```

Here, I think `get()` is needed to find the correct column in the data.table containing the token stream. `terms_to_drop` would be an additional argument for `as.instance_list()` which - in this first draft - would be simply a character vector of terms that should be dropped from the column indicated by the `p_attribute` argument. I assume that if `terms_to_drop` would default to NULL, each term would be kept but I did not yet check this.

This kind of subset works when you run each line of the function step by step. If you want to use this modified function as a whole, however, you get the error that the object `terms_to_drop` cannot be found. 

I could be mistaken here, but I assume the following: This subset is not evaluated in the same environment, i.e. `get_token_stream()` looks for an object called `terms_to_drop` in the global environment in which it does not find it (except the character vector containing these terms is, by chance, called like this, probably). An easy way to make this work would be to assign the `terms_to_drop` variable to the global environment before building the token_stream_list but I do not think that it is the best idea for a function to implicitly create objects there. So, I am not entirely sure how to solve this robustly.

The code suggested above also limits the possibilities of the subset argument, given that it also could be used to subset the token stream by more than one p-attribute. But for now, I would assume that the removal of specific terms would be a useful addition, at least as an option.

Concerning the removal of short documents, things might be easier. Introducing some kind of "min_length" argument and iterating through each element of token_stream_list, evaluating its length, seems to work. In the end of this, all empty token streams must be removed from the list, however, otherwise adding it to the instance_list won't work. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request | Add possibility to remove stopwords and short documents to the as.instance_list() function #6

Background

Issue

Idea

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request | Add possibility to remove stopwords and short documents to the as.instance_list() function #6

Description

Background

Issue

Idea

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions