Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip normalization options + add variations for BN-NL part of N-Best corpus #30

Merged
merged 58 commits into from
Apr 5, 2024
Merged
Show file tree
Hide file tree
Changes from 54 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
1d96aae
Merge pull request #10 from opensource-spraakherkenning-nl/main
greenw0lf Oct 5, 2023
3f7979e
Remove ' and - from the punctuations to be removed
greenw0lf Nov 6, 2023
0016129
Merge pull request #11 from opensource-spraakherkenning-nl/remove-das…
greenw0lf Nov 6, 2023
221eeeb
Add some variations + remove dash (-) again
greenw0lf Nov 6, 2023
86ec141
Merge pull request #12 from opensource-spraakherkenning-nl/add-variat…
greenw0lf Nov 6, 2023
3f6f38a
1 more variation + remove dash (-) from punctuation to be removed
greenw0lf Nov 13, 2023
655cf48
Merge remote-tracking branch 'origin/development' into add-variations
greenw0lf Nov 13, 2023
fc17e2c
Merge pull request #13 from opensource-spraakherkenning-nl/add-variat…
greenw0lf Nov 13, 2023
62f618e
Fix variation bug
greenw0lf Nov 13, 2023
e3b1eef
Merge pull request #14 from opensource-spraakherkenning-nl/bugfix
greenw0lf Nov 13, 2023
2c614e1
Add capitalization to BNR variation
greenw0lf Nov 13, 2023
f883af2
Merge pull request #15 from opensource-spraakherkenning-nl/variationfix
greenw0lf Nov 13, 2023
67467bd
Add support for skipping normalization in pipeline
greenw0lf Nov 17, 2023
310f587
Merge pull request #16 from opensource-spraakherkenning-nl/skip-norm
greenw0lf Nov 17, 2023
2250756
Add sclite -D flag for optional words
greenw0lf Nov 17, 2023
8489409
Merge pull request #17 from opensource-spraakherkenning-nl/sclite-add…
greenw0lf Nov 17, 2023
b08506c
Testing skip normalization in interface
greenw0lf Nov 17, 2023
751a43e
Merge pull request #18 from opensource-spraakherkenning-nl/skip-norm-…
greenw0lf Nov 17, 2023
a12f09b
More interface testing
greenw0lf Nov 17, 2023
030df21
Merge remote-tracking branch 'origin/development' into skip-norm-interf
greenw0lf Nov 17, 2023
ce990df
Merge pull request #19 from opensource-spraakherkenning-nl/skip-norm-…
greenw0lf Nov 17, 2023
ea69c29
Even more interface testing
greenw0lf Nov 17, 2023
c67210d
Merge remote-tracking branch 'origin/development' into skip-norm-interf
greenw0lf Nov 17, 2023
4b03a5d
Merge pull request #20 from opensource-spraakherkenning-nl/skip-norm-…
greenw0lf Nov 17, 2023
f1ad820
Hopefully last UI changes
greenw0lf Nov 18, 2023
52392ca
Merge pull request #21 from opensource-spraakherkenning-nl/skip-norm-…
greenw0lf Nov 18, 2023
64073b8
More UI testing
greenw0lf Nov 19, 2023
b0df705
Merge pull request #22 from opensource-spraakherkenning-nl/skip-norm-…
greenw0lf Nov 19, 2023
f6a3aab
Checkbox for skipping now visible
greenw0lf Nov 19, 2023
e1753b3
Merge remote-tracking branch 'origin/development' into development
greenw0lf Nov 19, 2023
79e2e75
Merge pull request #23 from opensource-spraakherkenning-nl/skip-norm-…
greenw0lf Nov 19, 2023
85da6e5
Add a break between UI elements
greenw0lf Nov 20, 2023
6af12c1
Merge pull request #24 from opensource-spraakherkenning-nl/skip-norm-…
greenw0lf Nov 20, 2023
30211cd
Make it look better?
greenw0lf Nov 20, 2023
9c74854
Merge remote-tracking branch 'origin/development' into skip-norm-interf
greenw0lf Nov 20, 2023
86d4596
Merge pull request #25 from opensource-spraakherkenning-nl/skip-norm-…
greenw0lf Nov 20, 2023
d4b5a23
Final touches for the interface
greenw0lf Nov 20, 2023
92df5f3
Merge remote-tracking branch 'origin/development' into skip-norm-interf
greenw0lf Nov 20, 2023
0db9095
Merge pull request #26 from opensource-spraakherkenning-nl/skip-norm-…
greenw0lf Nov 20, 2023
f0e687e
Fix issue with getting values from form submit
greenw0lf Nov 20, 2023
b213412
Merge pull request #27 from opensource-spraakherkenning-nl/skip-norm-…
greenw0lf Nov 20, 2023
f6b5f0a
Hopefully works this time (changed value to name)
greenw0lf Nov 20, 2023
665acda
Merge pull request #28 from opensource-spraakherkenning-nl/skip-norm-…
greenw0lf Nov 20, 2023
526ffd1
Add variations from top 20 confusion pairs
greenw0lf Nov 20, 2023
54196eb
Merge pull request #29 from opensource-spraakherkenning-nl/add-bn-nl-…
greenw0lf Nov 20, 2023
ff18c71
One final variation
greenw0lf Nov 20, 2023
cbf617e
Merge pull request #31 from opensource-spraakherkenning-nl/add-bn-nl-…
greenw0lf Nov 20, 2023
2a36c17
Test removing -m hyp from sclite command in pipeline
greenw0lf Dec 5, 2023
a323856
Merge pull request #32 from opensource-spraakherkenning-nl/experimental
greenw0lf Dec 5, 2023
04f99c3
add a flag that gives a more detailed breakdown
greenw0lf Dec 5, 2023
dea75dc
Merge pull request #33 from opensource-spraakherkenning-nl/experimental
greenw0lf Dec 5, 2023
528820a
Add another variation for Moszkowicz
greenw0lf Dec 7, 2023
a4ee732
Merge pull request #34 from opensource-spraakherkenning-nl/experimental
greenw0lf Dec 7, 2023
7b675ec
Update part of the README
greenw0lf Jan 3, 2024
3074dee
Add small changes before adding the sc_args functionality
greenw0lf Feb 22, 2024
a9d39bb
Remove sc_args (to be added upon request)
greenw0lf Feb 22, 2024
8c05370
Add test files
greenw0lf Mar 29, 2024
0d23a05
Rename folder with example files and add small comment in README
greenw0lf Apr 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion ASR_NL_benchmark/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,15 @@
metavar='value',
default='',
help='help: True if you want to use the GUI')
parser.add_argument('-skip_ref_normalization',
action = 'store_true',
help = 'Skip the normalization step for the reference file')
parser.add_argument('-skip_hyp_normalization',
action = 'store_true',
help = 'Skip the normalization step for the hypothesis file')
parser.add_argument('-skip-normalization',
action = 'store_true',
help = 'Skip the normalization step for both hypothesis and reference files')

args = parser.parse_args()

Expand All @@ -29,7 +38,12 @@
interface.main()
else:
print('Running benchmarking')
benchmarking = pipeline.Pipeline(args.hypfile[0], args.hypfile[1], args.reffile[0], args.reffile[1], kind=args.kind)
skip_ref_norm = args.skip_ref_normalization
skip_hyp_norm = args.skip_hyp_normalization
if args.skip_normalization:
skip_ref_norm = args.skip_ref_normalization
skip_hyp_norm = args.skip_hyp_normalization
benchmarking = pipeline.Pipeline(args.hypfile[0], args.hypfile[1], args.reffile[0], args.reffile[1], kind=args.kind, skip_ref_norm=skip_ref_norm, skip_hyp_norm=skip_hyp_norm)
benchmarking.main()
pipeline.process_results(kind=args.kind)

4 changes: 3 additions & 1 deletion ASR_NL_benchmark/interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,10 @@ def upload_page():
hyp = os.path.join(os.path.sep,'input',request.form.get('hyp'))
ref = os.path.join(os.path.sep,'input',request.form.get('ref'))
kind = request.form.get('kind')
skip_ref_norm = request.form.get('skip-ref-norm')
skip_hyp_norm = request.form.get('skip-hyp-norm')
global benchmarking
benchmarking = pipeline.Pipeline(hyp, 'ctm', ref, 'stm', kind)
benchmarking = pipeline.Pipeline(hyp, 'ctm', ref, 'stm', kind, skip_ref_norm, skip_hyp_norm)
Thread(target=benchmarking.main).start()
return redirect(f'/progress?ref={ref}&hyp={hyp}')
return render_template('select_files.html')
Expand Down
3 changes: 2 additions & 1 deletion ASR_NL_benchmark/normalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,10 @@ def replace_numbers_and_symbols(text):
>>> replace_numbers_and_symbols('12,3%')
'twaalf komma drie procent'
"""
removed_punct = string.punctuation.replace("'", '').replace('-', '')
text_without_symbols = replace_symbols(text)
clean_text = replace_numbers(text_without_symbols)
clean_text = clean_text.translate(str.maketrans('', '', string.punctuation))
clean_text = clean_text.translate(str.maketrans('', '', removed_punct))
return clean_text

def replace_numbers(text):
Expand Down
18 changes: 12 additions & 6 deletions ASR_NL_benchmark/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ def set_logging(logpath):
return logging


def run_pipeline(hypfile, reffile):
def run_pipeline(hypfile, reffile, skip_ref_norm, skip_hyp_norm):
""" Validates and Normalizes the hyp and ref file and runs them trough sclite
Args:
hypfile: the hypothesis file
Expand All @@ -37,9 +37,11 @@ def run_pipeline(hypfile, reffile):
reffile.validate(great_expectations_validation)

# Normalize
reffile.clean_text(replace_numbers_and_symbols)
if not skip_ref_norm:
reffile.clean_text(replace_numbers_and_symbols)
reffile.export(os.path.join(os.path.sep,'input',f'{reffile.name}_normalized.{reffile.extension}'))
hypfile.clean_text(replace_numbers_and_symbols)
if not skip_hyp_norm:
hypfile.clean_text(replace_numbers_and_symbols)
hypfile.export(os.path.join(os.path.sep,'input',f'{hypfile.name}_normalized.{hypfile.extension}'))

#Create results folder if not exists:
Expand All @@ -56,7 +58,7 @@ def run_pipeline(hypfile, reffile):
run = os.system(
f"csrfilt.sh -s -i stm {os.path.join('ASR_NL_benchmark','variations.glm')} < {reffile.normalized_path} > {reffile.variation_path}")
run = os.system(
f"sclite -h {hypfile.variation_path} {hypfile.extension} -r {reffile.variation_path} {reffile.extension} -m hyp -O {os.path.join(os.path.sep,'input','results')} -o dtl spk")
f"sclite -D -h {hypfile.variation_path} {hypfile.extension} -r {reffile.variation_path} {reffile.extension} -m hyp -O {os.path.join(os.path.sep,'input','results')} -o prf dtl spk")

def calculate_wer(df):
""" Calculates the word error rate and adds the collumn 'product' to the dataframe
Expand Down Expand Up @@ -210,19 +212,23 @@ def process_input(hypfile_arg, reffile_arg):


class Pipeline():
def __init__(self, hypfile_input_path, hypextension, reffile_input_path, refextension, kind):
def __init__(self, hypfile_input_path, hypextension, reffile_input_path, refextension, kind, skip_ref_norm, skip_hyp_norm):
self.progress = 0
self.failed = 0
self.hypfile_input_path = os.path.join(os.path.sep,'input',hypfile_input_path)
self.reffile_input_path = os.path.join(os.path.sep,'input',reffile_input_path)
self.hypextension = hypextension
self.refextension = refextension
self.kind = kind
self.skip_ref_norm = skip_ref_norm
self.skip_hyp_norm = skip_hyp_norm
self.logging = set_logging(logpath=os.path.join(os.path.sep,'input',f'{date.today()}_logging.log'))
self.logging.info(f"hypfile path from terminal: {hypfile_input_path}")
self.logging.info(f"reffile path from terminal: {reffile_input_path}")
self.logging.info(f"Pipeline class' hypfile path: {self.hypfile_input_path}")
self.logging.info(f"Pipeline class' reffile path: {self.reffile_input_path}")
self.logging.info(f"Skip reffile normalization: {self.skip_ref_norm}")
self.logging.info(f"Skip hypfile normalization: {self.skip_hyp_norm}")

def main(self):
hyp_list, ref_list = process_input(self.hypfile_input_path, self.reffile_input_path)
Expand All @@ -235,7 +241,7 @@ def main(self):
# Parse input
reffile = STM(reffile_path, self.refextension)
hypfile = CTM(hypfile_path, self.hypextension)
run_pipeline(hypfile, reffile)
run_pipeline(hypfile, reffile, self.skip_ref_norm, self.skip_hyp_norm)
done += 1
self.progress = done/total
except:
Expand Down
13 changes: 11 additions & 2 deletions ASR_NL_benchmark/templates/select_files.html
Original file line number Diff line number Diff line change
Expand Up @@ -21,17 +21,26 @@


<div class="container pt-3 m-3" width="80%">
<h1> Select Hypothese and Reference files or folders </h1>
<h1> Select Hypothesis and Reference files or folders </h1>
</div>
<div class="container pt-3 m-3" width="80%">
<div class="form-group">
<form method="POST">
<label>Name of speech recognizer</label>
<input type="text" class="form-control" id="kind" name="kind" placeholder="Name of speech recognizer">
<p>_______________________________</p>
<label>Path to hypothesis file or folder</label>
<input type="text" class="form-control" id="hyp" name="hyp" placeholder="Hyp File or folder">
<input type="checkbox" id="skip-hyp-norm" name="skip-hyp-norm">
<label for="skip-hyp-norm">Skip the normalization step for the hypothesis file(s)</label>
<br>
<p>_______________________________</p>
<label>Path to reference file or folder</label>
<input type="text" class="form-control" id="ref" name="ref" placeholder="Ref File or folder"><button type="submit" class="btn btn-primary" >Submit</button>
<input type="text" class="form-control" id="ref" name="ref" placeholder="Ref File or folder">
<input type="checkbox" id="skip-ref-norm" name="skip-ref-norm">
<label for="skip-ref-norm">Skip the normalization step for the reference file(s)</label>
<br>
<button type="submit" class="btn btn-primary" >Submit</button>
</form>
</div>
</div>
Expand Down
12 changes: 12 additions & 0 deletions ASR_NL_benchmark/variations.glm
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,15 @@ tewerk => te werk / [ ] __ [ ]
[marktonderzoekbureau] => [{ marktonderzoekbureau / marktonderzoeksbureau }] / [ ] __ [ ]
[Noordwestkust] => [{ Noordwestkust / Noord-Westkust }] / [ ] __ [ ]
[carnavalvierders] => [{ carnavalvierders / carnavalsvierders }] / [ ] __ [ ]

;; Whisper evaluation on N-Best
;; BN-NL
ie => hij / [ ] __ [ ]
da's => dat is / [ ] __ [ ]
[BNR-nieuwsradio] => [{ BNR-nieuwsradio / BNR nieuwsradio }] / [ ] __ [ ]
[Moszkowicz] => [{ Moszkowicz / Moskovic / Moskowitz }] / [ ] __ [ ]
[Kooi] => [{ Kooi / Kooij }] / [ ] __ [ ]
[Araújo] => [{ Araújo / Araujo }] / [ ] __ [ ]
[Bagdad] => [{ Bagdad / Baghdad }] / [ ] __ [ ]
[Holleeder] => [{ Holleeder / Holleder }] / [ ] __ [ ]
[Imac] => [{ Imac / Imaç }] / [ ] __ [ ]
Loading
Loading