Merge pull request #5 from NebularNerd/dev

Major refactor
NebularNerd · Feb 2, 2025 · 58fdb20 · 58fdb20
2 parents 65cb7b8 + dc44ff7
commit 58fdb20
Show file tree

Hide file tree

Showing 5 changed files with 320 additions and 169 deletions.
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,2 @@
+[flake8]
+max-line-length = 120
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 # subtotxt
-Quickly convert a [SubRip](https://en.wikipedia.org/wiki/SubRip) .srt or [WEBVTT](https://en.wikipedia.org/wiki/WebVTT) .vtt subtitle file to plain text. Removes timestamps and .srt subtitle line numbers. 
+Quickly convert a [SubRip](https://en.wikipedia.org/wiki/SubRip) .srt or [WEBVTT](https://en.wikipedia.org/wiki/WebVTT) .vtt subtitle file to plain text. Removes timestamps and .srt/.vtt subtitle line numbers. 
 This was a quick project thrown together for my girlfriend, she's still learning English and wanted to be able to read subtitles more like a transcript for some trickier language issues (and to understand the jokes in Friends by discussing them with me).  
 
 With a spot of feature creep and some encoding detection needs, it evolved into being able to detect character encoding, along with being able to understand both .srt and .vtt formats to save some pre-processing work.
@@ -10,7 +10,7 @@ or
 ```python C:\Python\subtotxt.py -f subtitle.vtt```  
 The script will check which format the subtitle file is (incase of incorrect file extensions), detect the character encoding used then write out a .txt file with the same name as your input. If the output file already exists it will ask for permission to delete and create a new one.
 ## Advanced Usage:
-The script has six more arguments you can parse:  
+The script has more advanced arguments you can parse:  
 - *--utf8* or *-8*  
 Forces the output file to use [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding. This may eliminate character encoding issues if you cannot view the output file. In practice, if you can read the contents of the input subtitle file successfully the output should work without the need to change the encoding.  
 - *--pause* or *-p*  
@@ -20,26 +20,27 @@ Prints the output to the console while writing to the file, may help with debugg
 - *--copy* or *-c*  
 Copies input to output without change, appends *-copy* to filename *e.g.: subtitle-copy.srt*, handy to use with *--utf8* to quickly change encoding. Might be useful if your video player app cannot understand your original subtitle file encoding.
 - *--overwrite* or *-o*  
-Skips asking ```Output file already exists, delete and make a new one? [y/n]``` and simply deletes the existing output file to create a new one. Ideal for batch processing.
+Skips asking `Output file already exists, delete and make a new one? [y/n]` and simply deletes the existing output file to create a new one. Ideal for batch processing.
 - *--oneliners* or *-1*   
 Writes all sentences in one line, even if the original file divides some sentences into many lines or subtitles.
 - *--help* or *-h*   
 Shows above information.
 ## Required External Modules:  
 - [Send2Trash](https://pypi.org/project/Send2Trash/) Python module to safely delete the old output file on both Win and \*nix based systems.
-- ~~[cchardet](https://pypi.org/project/cchardet/) Python module to detect your subtitle file encoding~~ (Removed for v2.0 release due to issues with Python 3.10.x installs, still used in v1.0 and will work on Python 3.9.x installs).  
-- [charset_normalizer](https://github.com/Ousret/charset_normalizer) Python module to detect your subtitle file encoding (v2.0+ supports Python 3.9.x and 3.10.x).   
+- ~~[cchardet](https://pypi.org/project/cchardet/) Python module to detect your subtitle file encoding~~ (Removed for v2.0+ release due to issues with Python 3.10.x installs, still used in v1.0 and will work on Python 3.9.x installs).  
+- [charset_normalizer](https://github.com/Ousret/charset_normalizer) Python module to detect your subtitle file encoding (v2.0 and YYYY-MM-DD versions, supports Python 3.9.x and above).   
 
-If your system does not these installed, it will auto install them on first use.  
+If your system does not these installed, it will auto install them on first use (or if you install a new version of Python later). If you prefer you can install them either manually, or by using the `requirements.txt`
 ## Features:
 - Fast (aside from initial missing modules install on slow net connections)
-- Input files character encoding formats are autodetected (if supported by [cchardet](https://pypi.org/project/cchardet/) [v1.0] or [charset_normalizer](https://github.com/Ousret/charset_normalizer) [v2.0+])  
+- Input files character encoding formats are autodetected (if supported by [cchardet](https://pypi.org/project/cchardet/) [v1.0] or [charset_normalizer](https://github.com/Ousret/charset_normalizer) [v2.0+]). For most languages it should be fine, for Chinese and near neighbour languages it can be tricky, a subtitle may contain valid characters for Mandarin or Cantonese (or other dialects) and be in  potentially the wrong encoding. This can result in some wonky detection but it should not affect the overall output.
 - Output files are wrote in the same encoding as the input or can be forced to UTF8
 - Should be cross platform friendly thanks to PathLib and Send2Trash
 - Handles UNC style ```\\myserver\myshare\mysub.srt``` paths thanks to PathLib
 - Handles SRT to TXT or WEBVTT to TXT
 - Handles multi line subtitles and subtitle lines with just numbers (does not confuse them with SRT line numbers)
-- WEBVTT: Removes 'WEBVTT', 'Kind: xxxx', 'Language: xxx' headers and Timestamps from output
+- Strips formatting tags, and rogue `{\an8}` tags you sometimes find in poorly converted subtitles 
+- WEBVTT: Removes 'WEBVTT', headers, metadata, notes, styles and timestamps from output
 - SRT: Removes subtitle line #'s and Timestamps, will not work if first subtitle is not 1 or if duplicated line numbers are present (rare cases but possible), use [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit) to renumber lines for now if this happens. 
 ## Examples:
 WEBVTT Input:
@@ -154,6 +155,5 @@ Output:
 - Possibly handle more formats (.ssa Sub Station Alpha would be the other major one I could think of), for now you can use something like [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit) to convert most other formats to .srt or .vtt. If you have a format you would like to convert to txt, contact me or raise an issue to see if I can add support.
 - GUI option for simple drag and drop usage.
 - Figure out a checking method for misnumbered or duplicate numbered SRT line numbers.
-- Handle stripping out SRT formatting tags for bold, italic etc...
 ## License:
 Released as CC0, use it how you wish. If you do use it elsewhere, please be awesome and tag me as the original author. 🙂
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,24 @@
+[tool.black]
+line-length = 120
+target-version = [
+  'py38',
+  'py39',
+  'py310',
+  'py311',
+  'py312',
+  'py313',
+]
+exclude = '''
+/(
+    \.eggs
+  | \.git
+  | \.idea
+  | \.pytest_cache
+  | \.github
+  | _build
+  | build
+  | dist
+  | venv
+  | test/resources
+)/
+'''
diff --git a/requirements.txt b/requirements.txt