Skip to content

Creating a speech bundle without interjections #281

@ChristophLeonhardt

Description

@ChristophLeonhardt

Subsetting a speech bundle results in a subcorpus bundle with unexpected subcorpora as the initial separation into speeches is not kept.

Hence the question: What is the most efficient way to create a speech bundle without interjections?

Scenario I: Splitting into speeches, then subsetting by paragraph type

Using GERMAPARL2 to create a speech bundle seems to work fine. The output is a subcorpus bundle with about 450 thousand subcorpora.

library(polmineR)

all_speeches <- corpus("GERMAPARL2") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date")

Assumption: I want to omit all interjections from these speeches. I think the logical step would be a subset.

all_speeches_min <- all_speeches |>
  subset(p_type == "speech")

Expected output: A subcorpus bundle with the same subcorpora (assuming that there are no speeches which only contain interjections) but without paragraphs which are not of type "speech".

Observed output: A subcorpus bundle with about 4400 subcorpora.

It seems like here there is one subcorpus for each unique speaker, not for each speech.

Scenario II: Subsetting by paragraph type, then splitting into speeches

In contrast, this seems to work.

all_speeches_2 <- corpus("GERMAPARL2") |>
  subset(p_type == "speech") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date")

Discussion

Aside from the second approach being very slow, it does not seem obvious to me why the first approach should not work. Is the first scenario supposed to work in the first place? If it should work like this, there might be a bug. If it not supposed to work like that, then some additional documentation might be useful.

Additional Remarks

The as.speeches() method also has a subset argument but as also written in the documentation, this is currently only useful for speaker names (speaker) and dates (date) and does not work for other structural attributes.

This was tested using polmineR 0.8.9.9001.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions