[Bug]: Exception raised when creating Sentence from text with apostrophes when using SegtokTokenizer #3594

dropther · 2025-01-06T13:33:04Z

Describe the bug

A ValueError: substring not found exception is raised when trying to create a Sentence from the text "John Oʼneill’s construction site".

The issue originates from SegtokTokenizer.tokenize("John Oʼneill’s construction site") that returns ['John', 'Oʼneill', 'O', 'ʼneill’s', 'construction', 'site'], which does not seem correct.

To Reproduce

from flair.data import Sentence

text = "John Oʼneill’s construction site"
sentence = Sentence(text)

Expected behavior

Creating a sentence object successfully from the text.

Logs and Stack traces

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 4
      1 from flair.data import Sentence
      3 text = "John Oʼneill’s construction site"
----> 4 sentence = Sentence(text)

File ~/.../.venv/lib/python3.12/site-packages/flair/data.py:868, in Sentence.__init__(self, text, use_tokenizer, language_code, start_position)
    866 previous_token: Optional[Token] = None
    867 for word in words:
--> 868     word_start_position: int = text.index(word, current_offset)
    869     delta_offset: int = word_start_position - current_offset
    871     token: Token = Token(text=word, start_position=word_start_position)

ValueError: substring not found

Screenshots

No response

Additional Context

No response

Environment

Versions:

Flair

0.15.0

Pytorch

2.5.1

Transformers

4.40.2

GPU

False

The text was updated successfully, but these errors were encountered:

alanakbik · 2025-01-07T12:34:04Z

Hello @dropther thanks for reporting this. It seems the error is caused by one of the functions in segtok, the library we use for tokenization:

from segtok.tokenizer import word_tokenizer, split_contractions

text = "John Oʼneill’s construction site"

# this part is ok
tokens = word_tokenizer(text)
print(tokens)

# the error happens here
after_split = split_contractions(tokens)
print(after_split)

If you replace ʼ with ' it works. So a quick workaround for now would be to make this replacement on your text.

heukirne · 2025-01-22T19:59:38Z

For me happens when \r appears in the beginning of the sentence:

s = 'O-\rBEG, sopros\rABD.'
sent = Sentence(s)

Result: ValueError: substring not found

s = 'O-\rBEG, sopros\rABD.'.replace('\r','')
sent = Sentence(s)

Result: OK

dropther added the bug Something isn't working label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Exception raised when creating Sentence from text with apostrophes when using SegtokTokenizer #3594

[Bug]: Exception raised when creating Sentence from text with apostrophes when using SegtokTokenizer #3594

dropther commented Jan 6, 2025

alanakbik commented Jan 7, 2025

heukirne commented Jan 22, 2025 •

edited

Loading

[Bug]: Exception raised when creating Sentence from text with apostrophes when using SegtokTokenizer #3594

[Bug]: Exception raised when creating Sentence from text with apostrophes when using SegtokTokenizer #3594

Comments

dropther commented Jan 6, 2025

Describe the bug

To Reproduce

Expected behavior

Logs and Stack traces

Screenshots

Additional Context

Environment

Versions:

Flair

Pytorch

Transformers

GPU

alanakbik commented Jan 7, 2025

heukirne commented Jan 22, 2025 • edited Loading

heukirne commented Jan 22, 2025 •

edited

Loading