-
Notifications
You must be signed in to change notification settings - Fork 66
Description
The ISO-TimeML version of the TimeML Standard offers (at least) the following benefits:
- Standoff Annotations (Chapter 3.3) (see compromise below)
- It preserves Tokenization
Read about it here:
https://lexitron.nectec.or.th/public/LREC-2010_Malta/pdf/55_Paper.pdf
If supporting the complete standard is too much work, it would still be nice, to have standoff annotations. We currently calculate those manually and fuzzy-match them to the Token- and Sentence-Boundaries detected by our own Preprocessing Pipeline.
Compromise to add standoff information to actual inline TimeML annotations
A simple fix to this specific problem would be (optionally) adding the CharacterPositions to the tagged Spans like so:
# input text:
"Today I feel great."
# currently generated TimeML output:
'<?xml version="1.0"?><!DOCTYPE TimeML SYSTEM "TimeML.dtd"><TimeML>
<TIMEX3 tid="t1" type="DATE" value="2021-11-16">Today</TIMEX3> nothing happened.
</TimeML>'
# Proposed additional tag-attributes (orig_start_char, orig_end_char):
<TIMEX3 tid="t1" type="DATE" value="2021-11-16" orig_start_char="0" orig_end_char="5">Today</TIMEX3>So this would capture the information the Original-Span tagged by the TIMEX3 with tid t1, is referring to the Span from character 0 (inclusive) to character 5 (exclusive).
Again, this information is necessary to synchronize HeidelTimes internally used but then forgotten Tokenization with your own Tokenization.
The information for those additional attributes should be easily accessible at runtime.
We've already implemented a first draft of a parsing algorithm that incrementally generates those char-based Span indices afterwards, but it feels like it's a lot of duplicate work to reconstruct information that has already been there at HeidelTime's runtime.