Inconsistency of xml types in subcorpus objects created by subset and split might cause issues

### Summary

Depending on how you create subcorpus objects, the seemingly identical objects might be characterized by different "xml types" (i.e. nested or flat). This might result in incorrect behavior, in particular in the wrong assignment of structural attributes using `s_attributes()`.

### Examples

For this example, I use "GERMAPARL2". I assume that the nested nature of the corpus might be relevant here. However, the issue of inconsistent xml assignments can also be observed for GermaParlMini, for example.

#### Data

Depending on the broader analysis workflow, there are different ways to create a subcorpus. You could either split a corpus into a number of subcorpora by creating a subcorpus bundle via `split()` or you could directly create a subcorpus using `subset()`.

The assumption would be that if you retrieve the same subcorpus from the subcorpus bundle, it should be identical to the subcorpus created with `subset()`.

So, we can create the following two objects:

```
library(polmineR)

date_from_subset <- corpus("GERMAPARL2") |>
  subset(protocol_date == "1949-09-07")

date_from_bundle <- corpus("GERMAPARL2") |>
  split(s_attribute = "protocol_date") |>
  _[[1]]
```

Both objects are identical in size and contain the same speakers, etc.

#### Issue: Wrongly returned structural attributes

Looking at the first speech of Konrad Adenauer in the corpus, we would expect that the speaker is Konrad Adenauer on 1949-09-07. Excluding all interjections with a `subset()` and retrieving speeches from the protocol of 1949-09-07, this should be true regardless of which of the two input objects we use.

**Note:** In the following, the sequence of `subset()` with p_type and `as.speeches()` is important here.

While the retrieval of s-attributes is correct for the subcorpus created based on the subset itself ...

```
date_from_subset |>
  subset(p_type == "speech") |>
  as.speeches(
    s_attribute_date = "protocol_date",
    s_attribute_name = "speaker_name",
    gap = 0
  ) |>
  _[["Konrad Adenauer_1949-09-07_1"]] |>
  s_attributes(c("protocol_date", "speaker_name"))
```

... it is incorrect for the subcorpus which is based on the subcorpus bundle:

```
date_from_bundle |>
  subset(p_type == "speech") |>
  as.speeches(
    s_attribute_date = "protocol_date",
    s_attribute_name = "speaker_name",
    gap = 0
  ) |>
  _[["Konrad Adenauer_1949-09-07_1"]] |>
  s_attributes(c("protocol_date", "speaker_name"))
```

This returns the wrong metadata.

### Possible Cause

In a nutshell, I think the issue is essentially threefold:

* the "xml type" is different between the two subcorpus objects
* this determines whether the *struc* or the *cpos* values in the subcorpus objects are used to retrieve structural attributes via `s_attributes()`
* the *struc* value in both subcorpora is the same and points to the paragraph (or p_type) of the subcorpus. The same value is then also used to retrieve other structural attributes which causes these issues above

In consequence, the retrieval of s-attributes works for the subcorpora above if the type is "nested" and it doesn't work if the type is "flat".

This also only occurs if more than one structural attribute is to be retrieved. It works fine for a single s-attribute because then, the process is different.

### Discussion

I am not sure what the underlying issue actually is. It could be either that the xml type is different between the two objects or it could be that the same struc value is used for all structural attributes in `s_attributes()`.

With the first issue in mind, I have seen that it is possible to explicitly state the xml type in `split()`. This should also address the issue. But this is rather difficult to see from a user perspective as the consequences of this are not that obvious.

If the main issue is this assignment of xml types, then interestingly, when using `split()` to split a subcorpus (instead of the entire corpus), the xml slot of the new subcorpus objects in the subcorpus bundle is filled with the value stored in the xml slot of the input subcorpus. This probably could be adopted for `split()` for corpus objects because these also include a xml slot.

However, note that apparently some additional consideration is in order here:

https://github.com/PolMine/polmineR/blob/650c75f593253f9aecd5033da195ae067561ff9c/R/split.R#L205

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency of xml types in subcorpus objects created by subset and split might cause issues #293

Summary

Examples

Data

Issue: Wrongly returned structural attributes

Possible Cause

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistency of xml types in subcorpus objects created by subset and split might cause issues #293

Description

Summary

Examples

Data

Issue: Wrongly returned structural attributes

Possible Cause

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions