NAs in cpos_left in output for `get_dbpedia_uris()` for subcorpora

### Issue

As discussed in issue #26, DBpedia Spotlight occasionally annotates entities which do not perfectly align with token spans. The fix discussed in issue #26 is incomplete, however.

Working with GermaParl, it became apparent that there are scenarios in which tokenization can be tricky for the left entity boundary as well. In phrases like "G-8-Gipfel" (which, in GermaParl is often tokenized in two tokens, "G" and "-8-Gipfel"), the entity identified by DBpedia Spotlight is "Gipfel" which starts in the middle of the token. This is an issue when we join tokens and entities based on their starting positions as the offset is different, thus leading to a "NA" value in the left corpus position of the entity. 

### Potential Solutions

If we want to address this, we could use the same approach as suggested for issue #26: Expand the span to the previous token boundary. For this, we could compare the starting positions of the entity and tokens and chose the previous token using an extended version of the `expand_fun()` auxiliary function introduced earlier:

```
expand_fun = function(.SD, direction) {
  if (direction == "right") {
    cpos_right <- dt[.SD[["end"]] == dt[["end"]]][["id"]]
    if (length(cpos_right) == 0 & isTRUE(expand_to_token)) {
      cpos_right <- dt[["id"]][which(dt[["end"]] > .SD[["end"]])[1]]
    } else {
      cpos_right
    }
  } else {
    cpos_left <- dt[.SD[["start"]] == dt[["start"]]][["id"]]
    if (length(cpos_left) == 0 & isTRUE(expand_to_token)) {
      cpos_vec <- which(dt[["start"]] < .SD[["start"]])
      cpos_left <- dt[["id"]][cpos_vec[length(cpos_vec)]]
    } else {
      cpos_left
    }
  }
}
```

This would make it necessary to adjust the following chunk as well:

```
tab <- links[,
             list(
               cpos_left = expand_fun(.SD, direction = "left"),
               cpos_right = expand_fun(.SD, direction = "right"),
               dbpedia_uri = .SD[["dbpedia_uri"]],
               text = .SD[["text"]],
               types = .SD[["types"]]
             ),
             by = "start",
             .SDcols = c("start", "end", "dbpedia_uri", "text", "types")
]
```

The possibility that there are incomplete annotations for "cpos_left" should be considered here as well:

https://github.com/PolMine/dbpedia/blob/4a8fd3c6ae7d006e5a6bd4f6f0436b7f1a930ec5/R/dbpedia.R#L693

### Discussion

As with issue #26, this should be optional and comes with some conceptual considerations, in particular whether it always makes sense to expand the entity span to match the token span.

This might also be not very efficient as this is checked for each entity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAs in cpos_left in output for `get_dbpedia_uris()` for subcorpora #44

Issue

Potential Solutions

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NAs in cpos_left in output for get_dbpedia_uris() for subcorpora #44

Description

Issue

Potential Solutions

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

NAs in cpos_left in output for `get_dbpedia_uris()` for subcorpora #44