Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Exception raised when creating Sentence from text with apostrophes when using SegtokTokenizer #3594

Open
dropther opened this issue Jan 6, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@dropther
Copy link

dropther commented Jan 6, 2025

Describe the bug

A ValueError: substring not found exception is raised when trying to create a Sentence from the text "John Oʼneill’s construction site".

The issue originates from SegtokTokenizer.tokenize("John Oʼneill’s construction site") that returns ['John', 'Oʼneill', 'O', 'ʼneill’s', 'construction', 'site'], which does not seem correct.

To Reproduce

from flair.data import Sentence

text = "John Oʼneill’s construction site"
sentence = Sentence(text)

Expected behavior

Creating a sentence object successfully from the text.

Logs and Stack traces

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 4
      1 from flair.data import Sentence
      3 text = "John Oʼneill’s construction site"
----> 4 sentence = Sentence(text)

File ~/.../.venv/lib/python3.12/site-packages/flair/data.py:868, in Sentence.__init__(self, text, use_tokenizer, language_code, start_position)
    866 previous_token: Optional[Token] = None
    867 for word in words:
--> 868     word_start_position: int = text.index(word, current_offset)
    869     delta_offset: int = word_start_position - current_offset
    871     token: Token = Token(text=word, start_position=word_start_position)

ValueError: substring not found

Screenshots

No response

Additional Context

No response

Environment

Versions:

Flair

0.15.0

Pytorch

2.5.1

Transformers

4.40.2

GPU

False

@dropther dropther added the bug Something isn't working label Jan 6, 2025
@alanakbik
Copy link
Collaborator

Hello @dropther thanks for reporting this. It seems the error is caused by one of the functions in segtok, the library we use for tokenization:

from segtok.tokenizer import word_tokenizer, split_contractions

text = "John Oʼneill’s construction site"

# this part is ok
tokens = word_tokenizer(text)
print(tokens)

# the error happens here
after_split = split_contractions(tokens)
print(after_split)

If you replace ʼ with ' it works. So a quick workaround for now would be to make this replacement on your text.

@heukirne
Copy link
Contributor

heukirne commented Jan 22, 2025

For me happens when \r appears in the beginning of the sentence:

s = 'O-\rBEG, sopros\rABD.'
sent = Sentence(s)

Result: ValueError: substring not found

s = 'O-\rBEG, sopros\rABD.'.replace('\r','')
sent = Sentence(s)

Result: OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants