Skip to content

decoding to AnnotatedPlainTextDocument fails when all tokens are removed by stopword list #291

@ChristophLeonhardt

Description

@ChristophLeonhardt

Scenario

I want to decode a document to an AnnotatedPlainTextDocument using a list of stopwords. If all tokens are removed when doing so, the process fails.

Example

As a minimal reproducible example, consider the following subcorpus:

library(polmineR)
use("polmineR")

x <- corpus("GERMAPARLMINI") |>
  subset(speaker == "Gerda Hasselfeldt") |>
  subset(protocol_date == "2009-11-11") |>
  subset(interjection == "speech") |>
  as.speeches(s_attribute_name = "speaker", s_attribute_date = "protocol_date", gap = 0) |>
  _[[15]]

(The subcorpus is chosen because it is very short)

Now let's assume that we want to decode the subcorpus to an AnnotatedPlainTextDocument while removing stopwords:

tokens_to_remove = c(
  "Bitte",
  "sehr",
  polmineR::punctuation
)

This fails because all tokens are removed:

doc <- decode(
  x,
  to = "AnnotatedPlainTextDocument",
  p_attributes = "word",
  mw = NULL,
  stoplist = tokens_to_remove,
  verbose = FALSE
)

Issue

The initial issue is that the data.table ts becomes empty if the stoplist is applied:

if (!is.null(stoplist)) ts <- ts[!ts[["word"]] %in% stoplist]

This results in an error later when the annotation object is created since some slots in the object are not empty.

Discussion

I assume that the obvious part of the solution is to check whether ts is empty (i.e. whether nrow(ts) == 0L) after applying the list of stopwords. However, I am not sure what should be returned here.

Normally, the return value would be an annotation object. Is returning NULL compatible with the usual workflows here or would it be better to return an empty AnnotatedPlainTextDocument instead?

This is somewhat related to issue #285.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions