You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A ValueError: substring not found exception is raised when trying to create a Sentence from the text "John Oʼneill’s construction site".
The issue originates from SegtokTokenizer.tokenize("John Oʼneill’s construction site") that returns ['John', 'Oʼneill', 'O', 'ʼneill’s', 'construction', 'site'], which does not seem correct.
To Reproduce
fromflair.dataimportSentencetext="John Oʼneill’s construction site"sentence=Sentence(text)
Expected behavior
Creating a sentence object successfully from the text.
Logs and Stack traces
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 4
1 from flair.data import Sentence
3 text = "John Oʼneill’s construction site"
----> 4 sentence = Sentence(text)
File ~/.../.venv/lib/python3.12/site-packages/flair/data.py:868, in Sentence.__init__(self, text, use_tokenizer, language_code, start_position)
866 previous_token: Optional[Token] = None
867 for word in words:
--> 868 word_start_position: int = text.index(word, current_offset)
869 delta_offset: int = word_start_position - current_offset
871 token: Token = Token(text=word, start_position=word_start_position)
ValueError: substring not found
Screenshots
No response
Additional Context
No response
Environment
Versions:
Flair
0.15.0
Pytorch
2.5.1
Transformers
4.40.2
GPU
False
The text was updated successfully, but these errors were encountered:
Hello @dropther thanks for reporting this. It seems the error is caused by one of the functions in segtok, the library we use for tokenization:
fromsegtok.tokenizerimportword_tokenizer, split_contractionstext="John Oʼneill’s construction site"# this part is oktokens=word_tokenizer(text)
print(tokens)
# the error happens hereafter_split=split_contractions(tokens)
print(after_split)
If you replace ʼ with ' it works. So a quick workaround for now would be to make this replacement on your text.
Describe the bug
A
ValueError: substring not found
exception is raised when trying to create a Sentence from the text"John Oʼneill’s construction site"
.The issue originates from
SegtokTokenizer.tokenize("John Oʼneill’s construction site")
that returns['John', 'Oʼneill', 'O', 'ʼneill’s', 'construction', 'site']
, which does not seem correct.To Reproduce
Expected behavior
Creating a sentence object successfully from the text.
Logs and Stack traces
Screenshots
No response
Additional Context
No response
Environment
Versions:
Flair
0.15.0
Pytorch
2.5.1
Transformers
4.40.2
GPU
False
The text was updated successfully, but these errors were encountered: