Merge pull request #6 from NebularNerd/dev

Adds .ssa/.ass support and Multifile batch processing
NebularNerd · Feb 3, 2025 · 94980a6 · 94980a6
2 parents 58fdb20 + 110be45
commit 94980a6
Show file tree

Hide file tree

Showing 3 changed files with 133 additions and 32 deletions.
diff --git a/.github/workflows/black.yml b/.github/workflows/black.yml
@@ -0,0 +1,11 @@
+name: Black Formatting and Linting
+
+on: [push, pull_request]
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: psf/black@stable
+
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 # subtotxt
-Quickly convert a [SubRip](https://en.wikipedia.org/wiki/SubRip) .srt or [WEBVTT](https://en.wikipedia.org/wiki/WebVTT) .vtt subtitle file to plain text. Removes timestamps and .srt/.vtt subtitle line numbers. 
+Quickly convert a [SubRip](https://en.wikipedia.org/wiki/SubRip) .srt, [SubStation Alpha](https://wiki.multimedia.cx/index.php?title=SubStation_Alpha) .ssa/.ass  or [WEBVTT](https://en.wikipedia.org/wiki/WebVTT) .vtt subtitle file to plain text. Removes timestamps and .srt/.vtt subtitle line numbers. 
 This was a quick project thrown together for my girlfriend, she's still learning English and wanted to be able to read subtitles more like a transcript for some trickier language issues (and to understand the jokes in Friends by discussing them with me).  
 
 With a spot of feature creep and some encoding detection needs, it evolved into being able to detect character encoding, along with being able to understand both .srt and .vtt formats to save some pre-processing work.
@@ -11,20 +11,16 @@ or
 The script will check which format the subtitle file is (incase of incorrect file extensions), detect the character encoding used then write out a .txt file with the same name as your input. If the output file already exists it will ask for permission to delete and create a new one.
 ## Advanced Usage:
 The script has more advanced arguments you can parse:  
-- *--utf8* or *-8*  
-Forces the output file to use [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding. This may eliminate character encoding issues if you cannot view the output file. In practice, if you can read the contents of the input subtitle file successfully the output should work without the need to change the encoding.  
-- *--pause* or *-p*  
-Pause the script at the sanity check stage to let you check some stats before continuing, handy if the output is not working.  
-- *--screen* or *-s*  
-Prints the output to the console while writing to the file, may help with debugging failed outputs.  
-- *--copy* or *-c*  
-Copies input to output without change, appends *-copy* to filename *e.g.: subtitle-copy.srt*, handy to use with *--utf8* to quickly change encoding. Might be useful if your video player app cannot understand your original subtitle file encoding.
-- *--overwrite* or *-o*  
-Skips asking `Output file already exists, delete and make a new one? [y/n]` and simply deletes the existing output file to create a new one. Ideal for batch processing.
-- *--oneliners* or *-1*   
-Writes all sentences in one line, even if the original file divides some sentences into many lines or subtitles.
-- *--help* or *-h*   
-Shows above information.
+- **--dir** or **-d**: Multiple file mode, use this **instead** of `-f` and point it at a folder containing your subtitles. It will run through and process them all, the files must have `.srt`, `.vtt`, `.ssa` or `.ass` extensions. Path can be a full path e.g. `C:\mysubs` or a relative path `.\`.
+- **--noname** or **-nn**: For SubStation Alpha this prevents prepending the subtitle line with the character name given in the file, if present. A line with a character might appear as `Blackadder: Your name is Bob?`. I highly recommend this setting if using `oneliners` below. For other formats we attempt to remove `NAME:` from the beginning of the subtitle line.
+- **--nosort** or **-ns**: Specifically for SubStation Alpha files, one aspect of these files is that the subtitles can be placed in any order, when the file is processed it works out when a line will appear. I imagine the main reason for this is you could split the dialogue into one block, and labels for signs, books, etc... in another. By default we sort and most examples I've seen have everything in one large block.
+- **--utf8** or **-8**: Forces the output file to use [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding. This may eliminate character encoding issues if you cannot view the output file. In practice, if you can read the contents of the input subtitle file successfully the output should work without the need to change the encoding.  
+- **--pause** or **-p**: Pause the script at the sanity check stage to let you check some stats before continuing, handy if the output is not working.  
+- **--screen** or **-s**: Prints the output to the console while writing to the file, may help with debugging failed outputs.  
+- **--copy** or **-c**: Copies input to output without change, appends *-copy* to filename *e.g.: subtitle-copy.srt*, handy to use with *--utf8* to quickly change encoding. Might be useful if your video player app cannot understand your original subtitle file encoding.
+- **--overwrite** or **-o**: Skips asking `Output file already exists, delete and make a new one? [y/n]` and simply deletes the existing output file to create a new one. Ideal for batch processing.
+- **--oneliners** or **-1**: Writes all sentences in one line, even if the original file divides some sentences into many lines or subtitles.
+- **--help** or **-h**: Shows above information.
 ## Required External Modules:  
 - [Send2Trash](https://pypi.org/project/Send2Trash/) Python module to safely delete the old output file on both Win and \*nix based systems.
 - ~~[cchardet](https://pypi.org/project/cchardet/) Python module to detect your subtitle file encoding~~ (Removed for v2.0+ release due to issues with Python 3.10.x installs, still used in v1.0 and will work on Python 3.9.x installs).  
@@ -33,15 +29,17 @@ Shows above information.
 If your system does not these installed, it will auto install them on first use (or if you install a new version of Python later). If you prefer you can install them either manually, or by using the `requirements.txt`
 ## Features:
 - Fast (aside from initial missing modules install on slow net connections)
+- Process a single file or point at a folder to process all supported files.
 - Input files character encoding formats are autodetected (if supported by [cchardet](https://pypi.org/project/cchardet/) [v1.0] or [charset_normalizer](https://github.com/Ousret/charset_normalizer) [v2.0+]). For most languages it should be fine, for Chinese and near neighbour languages it can be tricky, a subtitle may contain valid characters for Mandarin or Cantonese (or other dialects) and be in  potentially the wrong encoding. This can result in some wonky detection but it should not affect the overall output.
 - Output files are wrote in the same encoding as the input or can be forced to UTF8
 - Should be cross platform friendly thanks to PathLib and Send2Trash
 - Handles UNC style ```\\myserver\myshare\mysub.srt``` paths thanks to PathLib
 - Handles SRT to TXT or WEBVTT to TXT
 - Handles multi line subtitles and subtitle lines with just numbers (does not confuse them with SRT line numbers)
-- Strips formatting tags, and rogue `{\an8}` tags you sometimes find in poorly converted subtitles 
+- Strips formatting tags, and rogue `{\an8}` tags you sometimes find in poorly converted subtitles
 - WEBVTT: Removes 'WEBVTT', headers, metadata, notes, styles and timestamps from output
 - SRT: Removes subtitle line #'s and Timestamps, will not work if first subtitle is not 1 or if duplicated line numbers are present (rare cases but possible), use [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit) to renumber lines for now if this happens. 
+- SSA/ASS: Removes all non dialogue lines, detects script version, removes positional {xxx} tags from text.
 ## Examples:
 WEBVTT Input:
 ```  
@@ -152,7 +150,7 @@ Output:
     Fue estupendo.
 ```
 ## Future plans:
-- Possibly handle more formats (.ssa Sub Station Alpha would be the other major one I could think of), for now you can use something like [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit) to convert most other formats to .srt or .vtt. If you have a format you would like to convert to txt, contact me or raise an issue to see if I can add support.
+- Possibly handle more formats, for now you can use something like [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit) to convert most other formats to .srt or .vtt. If you have a format you would like to convert to txt, contact me or raise an issue to see if I can add support.
 - GUI option for simple drag and drop usage.
 - Figure out a checking method for misnumbered or duplicate numbered SRT line numbers.
 ## License:

diff --git a/subtotxt.py b/subtotxt.py
@@ -1,7 +1,7 @@
 # cSpell:disable
 # SRT or WEBVTT to plain Text
 # Author: NebularNerd
-# Version: 2025-01-31
+# Version: 2025-02-03
 # https://github.com/NebularNerd/subtotxt
 import sys
 import os
@@ -10,6 +10,8 @@
 import re
 from pathlib import Path
 
+version = "2025-02-03"
+
 
 def missing_modules_installer(required_modules):
     import platform
@@ -99,11 +101,16 @@ def testsub(self):
                     return "vtt"
                 if line.strip("\n") == "1" and re.search("(.*:.*:.*-->.*:.*:.*)", next(ts)):
                     return "srt"
+                if any(s in line for s in ["!:", "Timer:", "Style:", "Comment:", "Dialogue:", "ScriptType:"]):
+                    return "ass"
 
     def junklist(self):
         # This list will grow
         # Escaping and r(raw) tag needed for special characters
-        return ["<.*?>", r"\{\\an8\}", r"^-\s", r"\[.*\]", r"\(.*\)", "^.*?:"]
+        j = ["<.*?>", r"\{.*?\}", r"\[.*\]", r"\(.*\)", r"^-\s"]
+        if args.nonames:
+            j.append("^.*?:")
+        return j
 
 
 def cls():  # Clear screen win/*nix friendly
@@ -125,11 +132,23 @@ def yn(yn):  # Simple Y/N selector, use yn(text_for_choice)
 def arguments():
     parser = argparse.ArgumentParser(
         formatter_class=argparse.RawDescriptionHelpFormatter,
-        description="Quickly convert SRT or WEBVTT subtitles into plain text file.",
+        description="Quickly convert SRT, SSA or WEBVTT subtitles into plain text file.",
         epilog="Visit https://github.com/NebularNerd/subtotxt for more information.",
     )
-    parser.add_argument(
-        "--file", "-f", type=str, required=True, help="Path to .srt or .vtt file, enclose in quotes if path has spaces"
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument(
+        "--file",
+        "-f",
+        type=str,
+        required=False,
+        help="Path to .srt/.vtt/.ass/.ssa file, enclose in quotes if path has spaces",
+    )
+    group.add_argument(
+        "--dir",
+        "-d",
+        type=str,
+        required=False,
+        help="Path to folder containing subtitle files, process all files in folder",
     )
     parser.add_argument(
         "--utf8",
@@ -179,6 +198,22 @@ def arguments():
         required=False,
         help="Write all sentences in one line, even if the original divides it into many lines or subtitles.",
     )
+    parser.add_argument(
+        "--nonames",
+        "-nn",
+        default=False,
+        action="store_true",
+        required=False,
+        help="Removes character names if present (.ssa/.ass), attempts this for other formats.",
+    )
+    parser.add_argument(
+        "--nosort",
+        "-ns",
+        default=False,
+        action="store_true",
+        required=False,
+        help="For SubStation Alpha (.ssa/.ass), do not sort by timecode.",
+    )
     return parser.parse_args()
 
 
@@ -241,6 +276,7 @@ def do_srt():
     # SubRip subtitle file .srt
     # https://en.wikipedia.org/wiki/SubRip
     # Format has a line number followed by a timecode on the next line, then text.
+    print("Processing file as SubRip subtitles [.srt]")
     with open(file.i, "r", encoding=enc.enc) as original:
         subnum = 1
         for line in original:  # Ignore SRT Subtitle # and Timecode lines
@@ -258,6 +294,7 @@ def do_vtt():
     # This format has a few differing 'standards', you have:
     # Metadata, notes, styles, timceodes with optional hours, and optional line numbers,
     # almost none of which are actually used it seems. But we need to handle them
+    print("Processing file as WebVTT (Web Video Text Tracks) [.vtt]")
     with open(file.i, "r", encoding=enc.enc) as original:
         subnum = 1
         head = 1  # Try and skip over everything until we reach the subtitles.
@@ -274,6 +311,44 @@ def do_vtt():
     write_to_file()
 
 
+def do_ass():
+    # SubStation Alpha subtitle file .ssa/.ass
+    # https://wiki.multimedia.cx/index.php?title=SubStation_Alpha
+    # http://www.tcax.org/docs/ass-specs.htm Browser may complain as not https site.
+    # This format has different version, later ones include more metadata and sections,
+    # this should not be a big problem as teh text is always on a `Dialog:` line.
+    # Two keys issues are; lines may not be in timecode order,
+    # text may be for labelling things and not part of the script.
+    print("Processing file as SubStation Alpha subtitle [.ssa/.ass]")
+    with open(file.i, "r", encoding=enc.enc) as original:
+        # Try and get version
+        fv = ""
+        for line in original:
+            if "ScriptType:" in line:
+                fv = line.split(": ")[1].strip()
+        print(f"SSA Version: {fv}" if fv != "" else "No version found, assuming v1.0")
+        original.seek(0)
+        d = {}
+        for line in original:
+            # Example Dialog line v1.0:
+            # Dialogue: Marked=0,0:01:16.0,0:01:23.4,White Text,Usagi,0000,0000,0000,Pretty Soldier Sailor Moon
+            # Example Dialog line v3+:
+            # Dialogue: Marked=0,0:01:38.95,0:01:41.75,owari,Lupin,0000,0000,0000,,Yeah, love is wonderful.
+            if "Dialogue:" in line:
+                if fv == "":
+                    x = re.findall(r"Dialogue:.*?,(.*?\.\d*),.*?\.\d*,.*?,(.*?),.*?,.*?,.*?,(.*)", line)  # v1.0
+                else:
+                    x = re.findall(r"Dialogue:.*?,(.*?\.\d*),.*?\.\d*,(.*?),.*?,.*?,.*?,.*?,.*?,(.*)", line)  # v 3.0+
+                stc = x[0][0]  # Start timecode
+                nom = x[0][1]  # Character speaking
+                txt = x[0][2]  # Text
+                text = txt if (args.nonames or nom == "") else f"{nom}: {txt}"
+                d.update({stc: {"dialog": text}})
+        for t in [v["dialog"] for k, v in sorted(d.items())] if not args.nosort else [v["dialog"] for v in d.values()]:
+            process_line(t.replace(r"\n", " ").replace(r"\N", " "))  # Fixes odd newline in .ass
+    write_to_file()
+
+
 def write_to_file():
     with open(file.o, "w", encoding=enc.out) as new:
         # We check for junk again because it can gets split over two lines and we can't find it until now.
@@ -288,6 +363,8 @@ def do_work():
         do_srt()
     elif sub.format == "vtt":
         do_vtt()
+    elif sub.format == "ass":
+        do_ass()
     else:
         raise Exception("Unable to determine Subtitle format.")
 
@@ -296,16 +373,31 @@ def do_work():
     args = arguments()
     cls()
     try:
-        print(f"SUB to TXT v2025-01-31\n{'-' * 22}")
-        file = file_handler(Path(args.file))
-        enc = encoding(file.i)
-        if args.pause and not yn("Ready to start?"):
-            raise Exception("User exited at pause before start")
-        if args.copy:
-            copy()
-        else:
-            sub = subtitle()
-            do_work()
+        print(f"SUB to TXT v{version}\n{'-' * 22}")
+        if args.file or args.copy:
+            file = file_handler(Path(args.file))
+            enc = encoding(file.i)
+            if args.pause and not yn("Ready to start?"):
+                raise Exception("User exited at pause before start")
+            if args.copy:
+                copy()
+            else:
+                sub = subtitle()
+                do_work()
+        if args.dir:
+            files = list(filter(lambda p: p.suffix in {".srt", ".vtt", ".ssa", ".ass"}, Path(args.dir).glob("*")))
+            how_many = len(files)
+            c = 0
+            print(f"Multi file mode. Found {how_many} files.")
+            print("-" * 22)
+            for file in files:
+                file = file_handler(Path(file))
+                enc = encoding(file.i)
+                sub = subtitle()
+                do_work()
+                print("-" * 22)
+                c += 1
+            print(f"Processed {c}/{how_many} files.")
         print("\nFinished!\n")
     except Exception as error:
         print(f"Script execution stopped because:\n{error}")