-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Summary
Depending on how you create subcorpus objects, the seemingly identical objects might be characterized by different "xml types" (i.e. nested or flat). This might result in incorrect behavior, in particular in the wrong assignment of structural attributes using s_attributes().
Examples
For this example, I use "GERMAPARL2". I assume that the nested nature of the corpus might be relevant here. However, the issue of inconsistent xml assignments can also be observed for GermaParlMini, for example.
Data
Depending on the broader analysis workflow, there are different ways to create a subcorpus. You could either split a corpus into a number of subcorpora by creating a subcorpus bundle via split() or you could directly create a subcorpus using subset().
The assumption would be that if you retrieve the same subcorpus from the subcorpus bundle, it should be identical to the subcorpus created with subset().
So, we can create the following two objects:
library(polmineR)
date_from_subset <- corpus("GERMAPARL2") |>
subset(protocol_date == "1949-09-07")
date_from_bundle <- corpus("GERMAPARL2") |>
split(s_attribute = "protocol_date") |>
_[[1]]
Both objects are identical in size and contain the same speakers, etc.
Issue: Wrongly returned structural attributes
Looking at the first speech of Konrad Adenauer in the corpus, we would expect that the speaker is Konrad Adenauer on 1949-09-07. Excluding all interjections with a subset() and retrieving speeches from the protocol of 1949-09-07, this should be true regardless of which of the two input objects we use.
Note: In the following, the sequence of subset() with p_type and as.speeches() is important here.
While the retrieval of s-attributes is correct for the subcorpus created based on the subset itself ...
date_from_subset |>
subset(p_type == "speech") |>
as.speeches(
s_attribute_date = "protocol_date",
s_attribute_name = "speaker_name",
gap = 0
) |>
_[["Konrad Adenauer_1949-09-07_1"]] |>
s_attributes(c("protocol_date", "speaker_name"))
... it is incorrect for the subcorpus which is based on the subcorpus bundle:
date_from_bundle |>
subset(p_type == "speech") |>
as.speeches(
s_attribute_date = "protocol_date",
s_attribute_name = "speaker_name",
gap = 0
) |>
_[["Konrad Adenauer_1949-09-07_1"]] |>
s_attributes(c("protocol_date", "speaker_name"))
This returns the wrong metadata.
Possible Cause
In a nutshell, I think the issue is essentially threefold:
- the "xml type" is different between the two subcorpus objects
- this determines whether the struc or the cpos values in the subcorpus objects are used to retrieve structural attributes via
s_attributes() - the struc value in both subcorpora is the same and points to the paragraph (or p_type) of the subcorpus. The same value is then also used to retrieve other structural attributes which causes these issues above
In consequence, the retrieval of s-attributes works for the subcorpora above if the type is "nested" and it doesn't work if the type is "flat".
This also only occurs if more than one structural attribute is to be retrieved. It works fine for a single s-attribute because then, the process is different.
Discussion
I am not sure what the underlying issue actually is. It could be either that the xml type is different between the two objects or it could be that the same struc value is used for all structural attributes in s_attributes().
With the first issue in mind, I have seen that it is possible to explicitly state the xml type in split(). This should also address the issue. But this is rather difficult to see from a user perspective as the consequences of this are not that obvious.
If the main issue is this assignment of xml types, then interestingly, when using split() to split a subcorpus (instead of the entire corpus), the xml slot of the new subcorpus objects in the subcorpus bundle is filled with the value stored in the xml slot of the input subcorpus. This probably could be adopted for split() for corpus objects because these also include a xml slot.
However, note that apparently some additional consideration is in order here:
Line 205 in 650c75f
| y@xml = x@xml # to reconsider |