-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Issue
There are scenarios in which elements in the types column returned by get_dbpedia_uris() are not named lists. This is a) inconsistent and b) results in errors when checking for the types_src which relies on named elements in this column.
Example
See the following example:
library(dbpedia)
library(quanteda)
inaugural_paragraphs <- data_corpus_inaugural |>
corpus_subset(Year == 2021) |>
corpus_reshape(to = "paragraphs")
get_dbpedia_uris(
x = inaugural_paragraphs["2021-Biden.145"],
language = getOption("dbpedia.lang"),
max_len = 5600L,
confidence = 0.5,
support = 20,
types = character(),
api = getOption("dbpedia.endpoint"), # English endpoint
verbose = FALSE,
progress = FALSE
)
This will result in an error:
Error in FUN(X[[i]], ...) : subscript out of bounds
Likely underlying issue
Currently, the way to populate the types column in get_dbpedia_uris() usually results in either an empty list (if there are no types for the entity) or a list of lists containing entity types (if there are types for an entity). The names of the nested lists refer to the source/ontology the type is derived from.
This fails, however, if the document passed to get_dbpedia_uris() has only one entity and only types from one source. In this case, types are added as unnamed list elements to the column. This seems to be happening only if resource_min (the data.table containing entities) has only one row.
Error with types_src
This, in itself, is inconsistent and should be addressed. However, the lack of a name in the column results in an error in the subsequent mechanism to extract and filter the types by their source via the types_src argument. This relies on the elements in types being named.
Potential Solution
I think that when preparing the types for the column, it would be necessary to check if
- there are only types for a single element
- these types are all from the same source
In case there is only one type of a single source, e.g. "Person" from "DBpedia", wrapping this value into an additional list() should work.