Skip to content

Allow to pass pre-tokenized Texts to HeidelTime #89

@narnold-cl

Description

@narnold-cl

Hello,

thank you very much for your work. We are using HeidelTime in a dynamic setting and have several problems. We will list them here in issues together with design changes suggestions that should address them. Most will be straight-forward to implement for someone familiar with the project.

Is this software still under active development? If not, would you mind translating those high-level propositions to a lower level and point out, which parts of the implementation would need to change for that?

Standalone's dependencies

Speaking about the standalone version, as far as I understand, heideltime needs tokenized text to work, but it doesn't accept pretokenized text as input. Instead it contains hard-coded dependencies on external taggers (for tokenization as well as for POS-tagging), which need to be installed separately.

This has several disadvantages:

  • Out of sync Tokenization if you don't use the exact same Tokenizer (even then you have to run the Tokenizer twice)
  • The internally used Tokens are forgotten, as the TimeML-version in use does not support explicit Token-tags.
  • hard-coded dependencies (use those specific Tokenizers/Taggers or use none at all)
  • it's not standalone
  • currently generating the TimeML for a single textfile involves loading a big language model for Tokenization/POS-Tagging. tagging another file repeats the whole procedure.

Especially in dynamic contexts this introduces a huge cost that could be easily avoided.

It's quite simple to parse Tokenized texts, for example they could be given in a "one token per line" format, or similarly something like CoNLL. Not much harder should it be, to implement something similar allowing for already POS-tagged text, completely getting rid of hard-coded external dependencies without reducing performance, necessarily.

Solution:

  • Provide a way to parse pretokenized texts instead of invoking an external Tokenizer on your own.
  • Add CLI-Option to define data format (raw / pretokenized / POS-tagged (CoNLL)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions