Skip to content

NAs in cpos_left in output for get_dbpedia_uris() for subcorpora #44

@ChristophLeonhardt

Description

@ChristophLeonhardt

Issue

As discussed in issue #26, DBpedia Spotlight occasionally annotates entities which do not perfectly align with token spans. The fix discussed in issue #26 is incomplete, however.

Working with GermaParl, it became apparent that there are scenarios in which tokenization can be tricky for the left entity boundary as well. In phrases like "G-8-Gipfel" (which, in GermaParl is often tokenized in two tokens, "G" and "-8-Gipfel"), the entity identified by DBpedia Spotlight is "Gipfel" which starts in the middle of the token. This is an issue when we join tokens and entities based on their starting positions as the offset is different, thus leading to a "NA" value in the left corpus position of the entity.

Potential Solutions

If we want to address this, we could use the same approach as suggested for issue #26: Expand the span to the previous token boundary. For this, we could compare the starting positions of the entity and tokens and chose the previous token using an extended version of the expand_fun() auxiliary function introduced earlier:

expand_fun = function(.SD, direction) {
  if (direction == "right") {
    cpos_right <- dt[.SD[["end"]] == dt[["end"]]][["id"]]
    if (length(cpos_right) == 0 & isTRUE(expand_to_token)) {
      cpos_right <- dt[["id"]][which(dt[["end"]] > .SD[["end"]])[1]]
    } else {
      cpos_right
    }
  } else {
    cpos_left <- dt[.SD[["start"]] == dt[["start"]]][["id"]]
    if (length(cpos_left) == 0 & isTRUE(expand_to_token)) {
      cpos_vec <- which(dt[["start"]] < .SD[["start"]])
      cpos_left <- dt[["id"]][cpos_vec[length(cpos_vec)]]
    } else {
      cpos_left
    }
  }
}

This would make it necessary to adjust the following chunk as well:

tab <- links[,
             list(
               cpos_left = expand_fun(.SD, direction = "left"),
               cpos_right = expand_fun(.SD, direction = "right"),
               dbpedia_uri = .SD[["dbpedia_uri"]],
               text = .SD[["text"]],
               types = .SD[["types"]]
             ),
             by = "start",
             .SDcols = c("start", "end", "dbpedia_uri", "text", "types")
]

The possibility that there are incomplete annotations for "cpos_left" should be considered here as well:

if (isTRUE(drop_inexact_annotations) & any(is.na(tab[["cpos_right"]]))) {

Discussion

As with issue #26, this should be optional and comes with some conceptual considerations, in particular whether it always makes sense to expand the entity span to match the token span.

This might also be not very efficient as this is checked for each entity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions