Skip to content

Warnings caused by overlapping annotations when processing CWB corpora #43

@ChristophLeonhardt

Description

@ChristophLeonhardt

Issue

DBpedia Spotlight can return multiple entity annotations for the same token. In issue #42, I described the general issues with this. In one scenario, the overlapping entities share the same starting position. This is problematic for CWB corpora.

See the following example:

library(polmineR)
library(dbpedia)

sc <- corpus("GERMAPARL2") |>
  subset(speaker_name == "Heinrich von Brentano") |>
  subset(protocol_date == "1960-06-22") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date",
              gap = 50) |>
  _[[1]]

get_dbpedia_uris(
  x = sc,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.35,
  support = 20,
  api = getOption("dbpedia.endpoint"), # German endpoint
  verbose = FALSE,
  expand_to_token = TRUE
)

There are warnings stating that

Warning: longer object length is not a multiple of shorter object length

Likely Cause

This seems to be due to these lines in get_dbpedia_uris():

dbpedia/R/dbpedia.R

Lines 610 to 620 in f4dc779

tab <- links[,
list(
cpos_left = dt[.SD[["start"]] == dt[["start"]]][["id"]],
cpos_right = expand_fun(.SD),
dbpedia_uri = .SD[["dbpedia_uri"]],
text = .SD[["text"]],
types = .SD[["types"]]
),
by = "start",
.SDcols = c("start", "end", "dbpedia_uri", "text", "types")
]

DBpedia Spotlight adds different URIs to overlapping spans of tokens which share the same starting position. Since the starting position is the same for both annotations, dt[.SD[["start"]] == dt[["start"]]] is true for more than one token in the subcorpus. This causes the warning.

Possible solution

If we do not want to encode overlapping entities (in CWB corpora at least), we have to decide which annotation to keep and which to omit. Some options are already discussed in issue #42.

To circumvent the specific issue here, the first step would be to check whether there are multiple annotations for a single token (span). For this, a check like

if (any(table(resources$start) > 1)) 

could be added before resources is reduced to resources_min in get_dbpedia_uris() for subcorpora.

Then, it would be possible to introduce an argument which describes what to do in these cases.

Discussion

I am not sure about argument names and defaults. In addition, since this happens very rarely (for GermaParl at least), instead of an additional argument for get_dbpedia_uris(), it could also be considered to have an option which set the default behavior in such cases. But I am not sure if that is good practice.

As discussed in issue #42, there might be other options.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions