Improve get_n_words performance #187

diegozea · 2025-11-24T09:02:55Z

Reworked get_n_words in src/Utils/GeneralUtils.jl into a single-pass splitter that skips leading whitespace, tracks the last character boundary to avoid prevind overhead, and stops splitting after n-1 tokens so the nth entry still gathers the remainder. The function now returns only the words found (no trailing #undef/empty fields) and returns String[] for delimiter-only lines. Added a brief inline comment for the index bookkeeping.
Added coverage in test/Utils/GeneralUtils.jl for requests with too few words, trailing delimiters, and delimiter-only input to lock in the new behavior.

Performance (BenchmarkTools):

Baseline (saved before changes) vs current measured with @benchmark get_n_words(line,3) and judge(new, base):
- ASCII line: 79.48 ns (5 allocs, 208 B), 5.8% improvement
- UTF-8 line: 65.42 ns (5 allocs, 224 B), 18.0% improvement
- Allocation count is at the minimum (vector + returned strings), and whitespace-only lines now avoid allocating extra fields.
Profiled 200k calls with Profile.@profile; remaining hotspots are the unavoidable string slicing/allocations, confirming that the loop/prevind overhead was removed.

…d tests for edge cases

codecov · 2025-11-24T09:08:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.97%. Comparing base (8548471) to head (a12a7e0).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #187      +/-   ##
==========================================
+ Coverage   96.95%   96.97%   +0.02%     
==========================================
  Files          64       64              
  Lines        4860     4861       +1     
==========================================
+ Hits         4712     4714       +2     
+ Misses        148      147       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-11-24T09:11:16Z

Benchmark Results (Julia v1)

Time benchmarks

	master	`a12a7e0`...	master / `a12a7e0`...
Information/CorrectedMutualInformation/buslje09/msa	0.879 ± 0.0086 s	0.878 ± 0.0094 s	1 ± 0.015
Information/CorrectedMutualInformation/buslje09/msa_large	0.0362 ± 0.00013 s	0.0362 ± 0.0001 s	1 ± 0.0045
Information/CorrectedMutualInformation/buslje09/msa_wide	0.784 ± 0.008 s	0.784 ± 0.0042 s	1 ± 0.012
Information/MIp/PF09645	9.57 ± 0.035 ms	9.45 ± 0.032 ms	1.01 ± 0.0051
Information/frequencies!/1	0.28 ± 0.021 μs	0.281 ± 0.029 μs	0.996 ± 0.13
Information/frequencies!/2	1.44 ± 0.02 μs	1.44 ± 0.02 μs	1 ± 0.02
Information/highlevel/BLMI	0.0645 ± 0.00026 s	0.0643 ± 0.00017 s	1 ± 0.0048
Information/highlevel/buslje09	11.3 ± 0.05 ms	11.3 ± 0.053 ms	1 ± 0.0065
Information/shannon_entropy/PF09645	19.7 ± 0.47 μs	19.7 ± 0.4 μs	1 ± 0.031
MSA/Annotations/filtercolumns/boolean mask	9.83 ± 0.18 μs	9.92 ± 0.19 μs	0.991 ± 0.027
MSA/Annotations/filtercolumns/index array	3.51 ± 0.08 μs	3.52 ± 0.09 μs	0.997 ± 0.034
MSA/Base.vcat/annotated	4.42 ± 0.28 μs	4.35 ± 0.25 μs	1.02 ± 0.087
MSA/Base.vcat/unannotated	1.51 ± 0.12 μs	1.48 ± 0.12 μs	1.02 ± 0.12
MSA/Residue conversions/char2res	0.351 ± 0.76 ms	0.35 ± 0.64 ms	1 ± 2.8
MSA/Residue conversions/int2res	0.191 ± 0.08 ms	0.191 ± 0.085 ms	1 ± 0.61
MSA/Residue conversions/res2char	0.26 ± 0.013 ms	0.26 ± 0.013 ms	0.999 ± 0.071
MSA/Residue conversions/res2int	0.183 ± 0.12 ms	0.18 ± 0.13 ms	1.01 ± 0.97
MSA/hobohmI/pid62	0.511 ± 0.03 μs	0.511 ± 0.022 μs	1 ± 0.073
MSA/identity/matrix_Float64	17.6 ± 0.51 μs	17.5 ± 0.44 μs	1.01 ± 0.039
MSA/identity/mean	0.0878 ± 0.022 ms	0.0884 ± 0.022 ms	0.993 ± 0.35
MSA/read/FASTA.gz	0.0468 ± 0.0094 ms	0.0464 ± 0.0088 ms	1.01 ± 0.28
MSA/read/FASTA.gz_annotated	0.0497 ± 0.0016 ms	0.0507 ± 0.0019 ms	0.981 ± 0.048
MSA/read/FASTA_deletefullgaps	6.95 ± 0.37 ms	6.99 ± 1.1 ms	0.995 ± 0.17
MSA/read/FASTA_deletefullgaps_mapping	0.0916 ± 0.0054 s	0.0928 ± 0.0049 s	0.987 ± 0.078
MSA/read/Stockholm	0.0339 ± 0.0029 ms	0.0339 ± 0.0023 ms	0.999 ± 0.11
MSA/read/Stockholm_annotated	0.0456 ± 0.0086 ms	0.0458 ± 0.0084 ms	0.995 ± 0.26
MSA/read/Stockholm_mapping	0.185 ± 0.016 ms	0.185 ± 0.016 ms	1 ± 0.12
MSA/read/Stockholm_mapping_coords	0.0995 ± 0.028 ms	0.1 ± 0.028 ms	0.994 ± 0.39
MSA/write/FASTA	0.231 ± 0.038 ms	0.22 ± 0.041 ms	1.05 ± 0.26
PDB/_generate_interaction_keys/defaults	29.6 ± 17 μs	29.3 ± 17 μs	1.01 ± 0.83
PDB/_get_matched_Cαs/hemoglobin	0.0339 ± 0.0082 ms	0.0345 ± 0.0088 ms	0.983 ± 0.35
PDB/_pdbresidues_to_mmcifdict/2vqc	0.601 ± 0.023 ms	0.58 ± 0.029 ms	1.04 ± 0.066
PDB/contact/1CBN_20_30_CB	0.2 ± 0.001 μs	0.201 ± 0.01 μs	0.995 ± 0.05
PDB/contact/1CBN_20_30_heavy	0.251 ± 0.001 μs	0.251 ± 0.011 μs	1 ± 0.044
PDB/count_alanine/1CBN	0.331 ± 0.011 μs	0.331 ± 0.01 μs	1 ± 0.045
PDB/distance/1CBN_20_30	0.14 ± 0.01 μs	0.14 ± 0.001 μs	1 ± 0.072
PDB/read/MMCIFFile	2.93 ± 0.041 ms	2.94 ± 0.055 ms	0.997 ± 0.023
PDB/squared_distance/1CBN_20_30_CB	0.21 ± 0.001 μs	0.201 ± 0.01 μs	1.04 ± 0.052
PDB/squared_distance/1CBN_20_30_heavy	0.251 ± 0.01 μs	0.25 ± 0.01 μs	1 ± 0.057
Pfam/accession mapping/acc2seqnames	0.19 ± 0.0099 ms	0.187 ± 0.0094 ms	1.02 ± 0.074
SIFTS/ResidueDetails/_get_details	2.94 ± 0.86 μs	1.86 ± 0.57 μs	1.58 ± 0.67
SIFTS/ResidueDetails/_is_missing	3.08 ± 0.84 μs	1.99 ± 0.54 μs	1.54 ± 0.59
SIFTS/SIFTSResidue/18gs	0.1 ± 0.01 μs	0.1 ± 0.01 μs	1 ± 0.14
SIFTS/siftsmapping/2vqc	2.25 ± 0.055 ms	2.22 ± 0.048 ms	1.01 ± 0.033
Utils/get_n_words/ascii	0.15 ± 0.01 μs	0.14 ± 0.01 μs	1.07 ± 0.1
Utils/get_n_words/utf8	0.14 ± 0.01 μs	0.12 ± 0.001 μs	1.17 ± 0.084
Utils/hascoordinates/invalid	0.08 ± 0.01 μs	0.08 ± 0.001 μs	1 ± 0.13
Utils/hascoordinates/valid	0.13 ± 0.001 μs	0.13 ± 0.001 μs	1 ± 0.011
Utils/list2matrix/upper	0.178 ± 0.075 ms	0.174 ± 0.043 ms	1.02 ± 0.5
Utils/list2matrix/upper_diagonal	0.263 ± 0.096 ms	0.267 ± 0.09 ms	0.984 ± 0.49
Utils/matrix2list/upper	0.0712 ± 0.011 ms	0.0737 ± 0.01 ms	0.966 ± 0.2
Utils/matrix2list/upper_diagonal	0.075 ± 0.014 ms	0.0756 ± 0.011 ms	0.991 ± 0.23
time_to_load	0.785 ± 0.0039 s	0.792 ± 0.0055 s	0.992 ± 0.0084

Memory benchmarks

	master	`a12a7e0`...	master / `a12a7e0`...
Information/CorrectedMutualInformation/buslje09/msa	0.766 M allocs: 0.032 GB	0.766 M allocs: 0.032 GB	1
Information/CorrectedMutualInformation/buslje09/msa_large	0.0901 M allocs: 5.03 MB	0.0901 M allocs: 5.03 MB	1
Information/CorrectedMutualInformation/buslje09/msa_wide	0.742 M allocs: 30.3 MB	0.742 M allocs: 30.3 MB	1
Information/MIp/PF09645	20.3 k allocs: 0.819 MB	20.3 k allocs: 0.819 MB	1
Information/frequencies!/1	0 allocs: 0 B	0 allocs: 0 B
Information/frequencies!/2	0 allocs: 0 B	0 allocs: 0 B
Information/highlevel/BLMI	19.9 k allocs: 1.19 MB	19.9 k allocs: 1.19 MB	1
Information/highlevel/buslje09	0.0377 M allocs: 2.3 MB	0.0377 M allocs: 2.3 MB	1
Information/shannon_entropy/PF09645	0.047 k allocs: 12.2 kB	0.047 k allocs: 12.2 kB	1
MSA/Annotations/filtercolumns/boolean mask	18 allocs: 5.22 kB	18 allocs: 5.22 kB	1
MSA/Annotations/filtercolumns/index array	16 allocs: 1.62 kB	16 allocs: 1.62 kB	1
MSA/Base.vcat/annotated	0.143 k allocs: 6.58 kB	0.143 k allocs: 6.58 kB	1
MSA/Base.vcat/unannotated	0.064 k allocs: 2.7 kB	0.064 k allocs: 2.7 kB	1
MSA/Residue conversions/char2res	3 allocs: 4.1 MB	3 allocs: 4.1 MB	1
MSA/Residue conversions/int2res	3 allocs: 4.1 MB	3 allocs: 4.1 MB	1
MSA/Residue conversions/res2char	3 allocs: 2.05 MB	3 allocs: 2.05 MB	1
MSA/Residue conversions/res2int	3 allocs: 4.1 MB	3 allocs: 4.1 MB	1
MSA/hobohmI/pid62	31 allocs: 1.77 kB	31 allocs: 1.77 kB	1
MSA/identity/matrix_Float64	0.249 k allocs: 11.8 kB	0.249 k allocs: 11.8 kB	1
MSA/identity/mean	1.23 k allocs: 0.0517 MB	1.23 k allocs: 0.0517 MB	1
MSA/read/FASTA.gz	0.443 k allocs: 0.0752 MB	0.443 k allocs: 0.0752 MB	1
MSA/read/FASTA.gz_annotated	0.536 k allocs: 0.0794 MB	0.533 k allocs: 0.0793 MB	1
MSA/read/FASTA_deletefullgaps	13.6 k allocs: 17.4 MB	13.6 k allocs: 17.4 MB	1
MSA/read/FASTA_deletefullgaps_mapping	1.64 M allocs: 0.0795 GB	1.64 M allocs: 0.0795 GB	1
MSA/read/Stockholm	0.402 k allocs: 0.033 MB	0.402 k allocs: 0.033 MB	1
MSA/read/Stockholm_annotated	0.559 k allocs: 0.0413 MB	0.559 k allocs: 0.0413 MB	1
MSA/read/Stockholm_mapping	2.08 k allocs: 0.104 MB	2.08 k allocs: 0.104 MB	1
MSA/read/Stockholm_mapping_coords	1.64 k allocs: 0.0812 MB	1.64 k allocs: 0.0812 MB	1
MSA/write/FASTA	0.303 k allocs: 14.1 kB	0.303 k allocs: 14.1 kB	1
PDB/_generate_interaction_keys/defaults	0.497 k allocs: 0.0581 MB	0.497 k allocs: 0.0581 MB	1
PDB/_get_matched_Cαs/hemoglobin	0.584 k allocs: 0.0438 MB	0.584 k allocs: 0.0438 MB	1
PDB/_pdbresidues_to_mmcifdict/2vqc	8.56 k allocs: 1.12 MB	8.56 k allocs: 1.12 MB	1
PDB/contact/1CBN_20_30_CB	4 allocs: 0.281 kB	4 allocs: 0.281 kB	1
PDB/contact/1CBN_20_30_heavy	4 allocs: 0.281 kB	4 allocs: 0.281 kB	1
PDB/count_alanine/1CBN	0 allocs: 0 B	0 allocs: 0 B
PDB/distance/1CBN_20_30	0 allocs: 0 B	0 allocs: 0 B
PDB/read/MMCIFFile	0.039 M allocs: 2.9 MB	0.039 M allocs: 2.9 MB	1
PDB/squared_distance/1CBN_20_30_CB	4 allocs: 0.281 kB	4 allocs: 0.281 kB	1
PDB/squared_distance/1CBN_20_30_heavy	4 allocs: 0.281 kB	4 allocs: 0.281 kB	1
Pfam/accession mapping/acc2seqnames	4.32 k allocs: 0.319 MB	4.32 k allocs: 0.319 MB	1
SIFTS/ResidueDetails/_get_details	25 allocs: 1.45 kB	25 allocs: 1.45 kB	1
SIFTS/ResidueDetails/_is_missing	25 allocs: 1.45 kB	25 allocs: 1.45 kB	1
SIFTS/SIFTSResidue/18gs	4 allocs: 0.125 kB	4 allocs: 0.125 kB	1
SIFTS/siftsmapping/2vqc	5.94 k allocs: 0.88 MB	5.94 k allocs: 0.88 MB	1
Utils/get_n_words/ascii	5 allocs: 0.203 kB	5 allocs: 0.203 kB	1
Utils/get_n_words/utf8	5 allocs: 0.219 kB	5 allocs: 0.219 kB	1
Utils/hascoordinates/invalid	0 allocs: 0 B	0 allocs: 0 B
Utils/hascoordinates/valid	0 allocs: 0 B	0 allocs: 0 B
Utils/list2matrix/upper	3 allocs: 1.91 MB	3 allocs: 1.91 MB	1
Utils/list2matrix/upper_diagonal	6 allocs: 2.86 MB	6 allocs: 2.86 MB	1
Utils/matrix2list/upper	3 allocs: 0.952 MB	3 allocs: 0.952 MB	1
Utils/matrix2list/upper_diagonal	3 allocs: 0.956 MB	3 allocs: 0.956 MB	1
time_to_load	0.149 k allocs: 11.1 kB	0.149 k allocs: 11.1 kB	1

coveralls · 2025-11-24T09:14:57Z

coverage: 97.156% (+0.02%) from 97.135%
when pulling a12a7e0 on get_n_words_perf
into 8548471 on master.

diegozea · 2025-11-24T15:44:11Z

Note: Benchmark CI shows only a modest micro-benchmark improvement for get_n_words and no visible speedup when reading large Stockholm files. The main benefit of this change is clearer behavior for empty/whitespace lines, enforcing the “n or fewer words” contract, and simplifying the implementation.

diegozea added 2 commits November 24, 2025 09:22

Update get_n_words to resize words vector based on actual word count

793bca6

Refactor get_n_words to improve handling of leading delimiters and ad…

87b245b

…d tests for edge cases

diegozea added 2 commits November 24, 2025 10:28

Add tests for get_n_words to handle empty string cases

b736e34

Remove unused _find_next_space_or_tab function

a12a7e0

diegozea merged commit 04de5ee into master Nov 24, 2025
17 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve get_n_words performance #187

Improve get_n_words performance #187

Uh oh!

diegozea commented Nov 24, 2025

Uh oh!

codecov bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

coveralls commented Nov 24, 2025 •

edited

Loading

Uh oh!

diegozea commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve get_n_words performance #187

Improve get_n_words performance #187

Uh oh!

Conversation

diegozea commented Nov 24, 2025

Uh oh!

codecov bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results (Julia v1)

Uh oh!

coveralls commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diegozea commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 24, 2025 •

edited

Loading

github-actions bot commented Nov 24, 2025 •

edited

Loading

coveralls commented Nov 24, 2025 •

edited

Loading