-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Issue
DBpedia Spotlight can return multiple entity annotations for the same token. In issue #42, I described the general issues with this. In one scenario, the overlapping entities share the same starting position. This is problematic for CWB corpora.
See the following example:
library(polmineR)
library(dbpedia)
sc <- corpus("GERMAPARL2") |>
subset(speaker_name == "Heinrich von Brentano") |>
subset(protocol_date == "1960-06-22") |>
as.speeches(s_attribute_name = "speaker_name",
s_attribute_date = "protocol_date",
gap = 50) |>
_[[1]]
get_dbpedia_uris(
x = sc,
language = getOption("dbpedia.lang"),
max_len = 5600L,
confidence = 0.35,
support = 20,
api = getOption("dbpedia.endpoint"), # German endpoint
verbose = FALSE,
expand_to_token = TRUE
)
There are warnings stating that
Warning: longer object length is not a multiple of shorter object length
Likely Cause
This seems to be due to these lines in get_dbpedia_uris():
Lines 610 to 620 in f4dc779
| tab <- links[, | |
| list( | |
| cpos_left = dt[.SD[["start"]] == dt[["start"]]][["id"]], | |
| cpos_right = expand_fun(.SD), | |
| dbpedia_uri = .SD[["dbpedia_uri"]], | |
| text = .SD[["text"]], | |
| types = .SD[["types"]] | |
| ), | |
| by = "start", | |
| .SDcols = c("start", "end", "dbpedia_uri", "text", "types") | |
| ] |
DBpedia Spotlight adds different URIs to overlapping spans of tokens which share the same starting position. Since the starting position is the same for both annotations, dt[.SD[["start"]] == dt[["start"]]] is true for more than one token in the subcorpus. This causes the warning.
Possible solution
If we do not want to encode overlapping entities (in CWB corpora at least), we have to decide which annotation to keep and which to omit. Some options are already discussed in issue #42.
To circumvent the specific issue here, the first step would be to check whether there are multiple annotations for a single token (span). For this, a check like
if (any(table(resources$start) > 1))
could be added before resources is reduced to resources_min in get_dbpedia_uris() for subcorpora.
Then, it would be possible to introduce an argument which describes what to do in these cases.
Discussion
I am not sure about argument names and defaults. In addition, since this happens very rarely (for GermaParl at least), instead of an additional argument for get_dbpedia_uris(), it could also be considered to have an option which set the default behavior in such cases. But I am not sure if that is good practice.
As discussed in issue #42, there might be other options.