`get_token_stream()` potentially returns token stream of the wrong length when using arguments subset, collapse and beautify

I noticed that `get_token_stream()` occasionally returns token streams which are longer than expected when used with the `subset` argument.

As an example, I have the following subcorpus:

```
library(polmineR) # v0.8.9.9004
use("polmineR")

sc <- corpus("GERMAPARLMINI") |>
  subset(protocol_date == "2009-10-27") |>
  subset(speaker == "Volker Kauder")
``` 

### Without `subset`

I retrieve the token stream for the subcorpus without subsetting it:

```
chars_with_stopwords <- get_token_stream(sc,
                                         p_attribute = "word",
                                         collapse = " ")

nchar(chars_with_stopwords) # 185
```

The returned character vector has a length of 185 characters.

### With `subset`

If I repeat the same process but include a `subset` argument to remove stop words and punctuation, the return value gets longer instead of shorter.

```
tokens_to_remove = c(
  tm::stopwords("de"),
  polmineR::punctuation
)


chars_without_stopwords <- get_token_stream(
    sc,
    p_attribute = "word",
    collapse = " ",
    subset = {!word %in% tokens_to_remove}
)

nchar(chars_without_stopwords) # 270
```

### Issue

Looking at `get_token_stream()` I think that the issue is the combination of `subset`, `collapse` and `beautify` (which is TRUE by default). With these arguments, the following line essentially causes the issue: 

https://github.com/PolMine/polmineR/blob/650c75f593253f9aecd5033da195ae067561ff9c/R/token_stream.R#L150

The issue is that when removing tokens via the subset, the length of the input object does not correspond to the number of whitespace characters actually needed here. Then, in the final line

https://github.com/PolMine/polmineR/blob/650c75f593253f9aecd5033da195ae067561ff9c/R/token_stream.R#L154

`whitespace` is longer than `tokens`. The remaining tokens then are simply recycled until the length of `whitespace` is reached.

### Potential Fix

If there is no reason to use the length of the unmodified input object here, I think that changing `.Object` to `tokens` in the first chunk I quoted should be sufficient to address this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`get_token_stream()` potentially returns token stream of the wrong length when using arguments subset, collapse and beautify #290

Without `subset`

With `subset`

Issue

Potential Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

get_token_stream() potentially returns token stream of the wrong length when using arguments subset, collapse and beautify #290

Description

Without subset

With subset

Issue

Potential Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`get_token_stream()` potentially returns token stream of the wrong length when using arguments subset, collapse and beautify #290

Without `subset`

With `subset`