Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Captions whose text begins with Line Separator character are parsed as blank string #87

Open
ontl opened this issue Jun 17, 2021 · 1 comment

Comments

@ontl
Copy link

ontl commented Jun 17, 2021

I occasionally see SRTs in which 1 or 2 captions begin with the Line Separator character, u2028. Those captions get incorrectly parsed as blank.

I believe the character originates in Word, and is carried over when transcript is copy-pasted to YouTube to use YouTube's transcript auto-timing function.

This character seems to act as a normal line break when in the middle or end of a caption; the issue only arises when it is the first character of the caption.

I think the parser to ignore this character.

VLC, for the record, ignores it and displays the caption normally.

Gotchas:
It may make sense to pre-process the file, replacing u2028 with a more compatible line break like \n. We should be careful, though, not to inadvertently trigger the blank line state outlined in Issue 71 by having a caption start with \n.

Example SRT that exhibits this problem:

1
00:00:08,330 --> 00:00:13,653

This caption starts with the character
u2028, which causes PySRT to see it as blank.

2
00:00:13,653 --> 00:00:18,305
This caption has a u2028 here:
 which does not cause issues.

3
00:00:18,305 --> 00:00:22,906

This caption starts with a normal line break; VLC
and PySRT show it as blank as per Issue 71.

Output:

  • Caption 1: VLC displays the caption, PySRT parses it as blank
  • Caption 2: VLC and PySRT display the caption
  • Caption 3: VLC and PySRT show the caption as blank
@ontl
Copy link
Author

ontl commented Jun 19, 2021

After some poking around, I've had success preprocessing my srt files with .replace('\n\u2028', '\n')

Will look through the pysrt code and submit a PR if I can find the best place/method to do this. Suggestions welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant