Skip to content

Conversation

@diegozea
Copy link
Owner

  • Reworked get_n_words in src/Utils/GeneralUtils.jl into a single-pass splitter that skips leading whitespace, tracks the last character boundary to avoid prevind overhead, and stops splitting after n-1 tokens so the nth entry still gathers the remainder. The function now returns only the words found (no trailing #undef/empty fields) and returns String[] for delimiter-only lines. Added a brief inline comment for the index bookkeeping.

  • Added coverage in test/Utils/GeneralUtils.jl for requests with too few words, trailing delimiters, and delimiter-only input to lock in the new behavior.

Performance (BenchmarkTools):

  • Baseline (saved before changes) vs current measured with @benchmark get_n_words(line,3) and judge(new, base):

    • ASCII line: 79.48 ns (5 allocs, 208 B), 5.8% improvement
    • UTF-8 line: 65.42 ns (5 allocs, 224 B), 18.0% improvement
    • Allocation count is at the minimum (vector + returned strings), and whitespace-only lines now avoid allocating extra fields.
  • Profiled 200k calls with Profile.@profile; remaining hotspots are the unavoidable string slicing/allocations, confirming that the loop/prevind overhead was removed.

@codecov
Copy link

codecov bot commented Nov 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.97%. Comparing base (8548471) to head (a12a7e0).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #187      +/-   ##
==========================================
+ Coverage   96.95%   96.97%   +0.02%     
==========================================
  Files          64       64              
  Lines        4860     4861       +1     
==========================================
+ Hits         4712     4714       +2     
+ Misses        148      147       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 24, 2025

Benchmark Results (Julia v1)

Time benchmarks
master a12a7e0... master / a12a7e0...
Information/CorrectedMutualInformation/buslje09/msa 0.879 ± 0.0086 s 0.878 ± 0.0094 s 1 ± 0.015
Information/CorrectedMutualInformation/buslje09/msa_large 0.0362 ± 0.00013 s 0.0362 ± 0.0001 s 1 ± 0.0045
Information/CorrectedMutualInformation/buslje09/msa_wide 0.784 ± 0.008 s 0.784 ± 0.0042 s 1 ± 0.012
Information/MIp/PF09645 9.57 ± 0.035 ms 9.45 ± 0.032 ms 1.01 ± 0.0051
Information/frequencies!/1 0.28 ± 0.021 μs 0.281 ± 0.029 μs 0.996 ± 0.13
Information/frequencies!/2 1.44 ± 0.02 μs 1.44 ± 0.02 μs 1 ± 0.02
Information/highlevel/BLMI 0.0645 ± 0.00026 s 0.0643 ± 0.00017 s 1 ± 0.0048
Information/highlevel/buslje09 11.3 ± 0.05 ms 11.3 ± 0.053 ms 1 ± 0.0065
Information/shannon_entropy/PF09645 19.7 ± 0.47 μs 19.7 ± 0.4 μs 1 ± 0.031
MSA/Annotations/filtercolumns/boolean mask 9.83 ± 0.18 μs 9.92 ± 0.19 μs 0.991 ± 0.027
MSA/Annotations/filtercolumns/index array 3.51 ± 0.08 μs 3.52 ± 0.09 μs 0.997 ± 0.034
MSA/Base.vcat/annotated 4.42 ± 0.28 μs 4.35 ± 0.25 μs 1.02 ± 0.087
MSA/Base.vcat/unannotated 1.51 ± 0.12 μs 1.48 ± 0.12 μs 1.02 ± 0.12
MSA/Residue conversions/char2res 0.351 ± 0.76 ms 0.35 ± 0.64 ms 1 ± 2.8
MSA/Residue conversions/int2res 0.191 ± 0.08 ms 0.191 ± 0.085 ms 1 ± 0.61
MSA/Residue conversions/res2char 0.26 ± 0.013 ms 0.26 ± 0.013 ms 0.999 ± 0.071
MSA/Residue conversions/res2int 0.183 ± 0.12 ms 0.18 ± 0.13 ms 1.01 ± 0.97
MSA/hobohmI/pid62 0.511 ± 0.03 μs 0.511 ± 0.022 μs 1 ± 0.073
MSA/identity/matrix_Float64 17.6 ± 0.51 μs 17.5 ± 0.44 μs 1.01 ± 0.039
MSA/identity/mean 0.0878 ± 0.022 ms 0.0884 ± 0.022 ms 0.993 ± 0.35
MSA/read/FASTA.gz 0.0468 ± 0.0094 ms 0.0464 ± 0.0088 ms 1.01 ± 0.28
MSA/read/FASTA.gz_annotated 0.0497 ± 0.0016 ms 0.0507 ± 0.0019 ms 0.981 ± 0.048
MSA/read/FASTA_deletefullgaps 6.95 ± 0.37 ms 6.99 ± 1.1 ms 0.995 ± 0.17
MSA/read/FASTA_deletefullgaps_mapping 0.0916 ± 0.0054 s 0.0928 ± 0.0049 s 0.987 ± 0.078
MSA/read/Stockholm 0.0339 ± 0.0029 ms 0.0339 ± 0.0023 ms 0.999 ± 0.11
MSA/read/Stockholm_annotated 0.0456 ± 0.0086 ms 0.0458 ± 0.0084 ms 0.995 ± 0.26
MSA/read/Stockholm_mapping 0.185 ± 0.016 ms 0.185 ± 0.016 ms 1 ± 0.12
MSA/read/Stockholm_mapping_coords 0.0995 ± 0.028 ms 0.1 ± 0.028 ms 0.994 ± 0.39
MSA/write/FASTA 0.231 ± 0.038 ms 0.22 ± 0.041 ms 1.05 ± 0.26
PDB/_generate_interaction_keys/defaults 29.6 ± 17 μs 29.3 ± 17 μs 1.01 ± 0.83
PDB/_get_matched_Cαs/hemoglobin 0.0339 ± 0.0082 ms 0.0345 ± 0.0088 ms 0.983 ± 0.35
PDB/_pdbresidues_to_mmcifdict/2vqc 0.601 ± 0.023 ms 0.58 ± 0.029 ms 1.04 ± 0.066
PDB/contact/1CBN_20_30_CB 0.2 ± 0.001 μs 0.201 ± 0.01 μs 0.995 ± 0.05
PDB/contact/1CBN_20_30_heavy 0.251 ± 0.001 μs 0.251 ± 0.011 μs 1 ± 0.044
PDB/count_alanine/1CBN 0.331 ± 0.011 μs 0.331 ± 0.01 μs 1 ± 0.045
PDB/distance/1CBN_20_30 0.14 ± 0.01 μs 0.14 ± 0.001 μs 1 ± 0.072
PDB/read/MMCIFFile 2.93 ± 0.041 ms 2.94 ± 0.055 ms 0.997 ± 0.023
PDB/squared_distance/1CBN_20_30_CB 0.21 ± 0.001 μs 0.201 ± 0.01 μs 1.04 ± 0.052
PDB/squared_distance/1CBN_20_30_heavy 0.251 ± 0.01 μs 0.25 ± 0.01 μs 1 ± 0.057
Pfam/accession mapping/acc2seqnames 0.19 ± 0.0099 ms 0.187 ± 0.0094 ms 1.02 ± 0.074
SIFTS/ResidueDetails/_get_details 2.94 ± 0.86 μs 1.86 ± 0.57 μs 1.58 ± 0.67
SIFTS/ResidueDetails/_is_missing 3.08 ± 0.84 μs 1.99 ± 0.54 μs 1.54 ± 0.59
SIFTS/SIFTSResidue/18gs 0.1 ± 0.01 μs 0.1 ± 0.01 μs 1 ± 0.14
SIFTS/siftsmapping/2vqc 2.25 ± 0.055 ms 2.22 ± 0.048 ms 1.01 ± 0.033
Utils/get_n_words/ascii 0.15 ± 0.01 μs 0.14 ± 0.01 μs 1.07 ± 0.1
Utils/get_n_words/utf8 0.14 ± 0.01 μs 0.12 ± 0.001 μs 1.17 ± 0.084
Utils/hascoordinates/invalid 0.08 ± 0.01 μs 0.08 ± 0.001 μs 1 ± 0.13
Utils/hascoordinates/valid 0.13 ± 0.001 μs 0.13 ± 0.001 μs 1 ± 0.011
Utils/list2matrix/upper 0.178 ± 0.075 ms 0.174 ± 0.043 ms 1.02 ± 0.5
Utils/list2matrix/upper_diagonal 0.263 ± 0.096 ms 0.267 ± 0.09 ms 0.984 ± 0.49
Utils/matrix2list/upper 0.0712 ± 0.011 ms 0.0737 ± 0.01 ms 0.966 ± 0.2
Utils/matrix2list/upper_diagonal 0.075 ± 0.014 ms 0.0756 ± 0.011 ms 0.991 ± 0.23
time_to_load 0.785 ± 0.0039 s 0.792 ± 0.0055 s 0.992 ± 0.0084
Memory benchmarks
master a12a7e0... master / a12a7e0...
Information/CorrectedMutualInformation/buslje09/msa 0.766 M allocs: 0.032 GB 0.766 M allocs: 0.032 GB 1
Information/CorrectedMutualInformation/buslje09/msa_large 0.0901 M allocs: 5.03 MB 0.0901 M allocs: 5.03 MB 1
Information/CorrectedMutualInformation/buslje09/msa_wide 0.742 M allocs: 30.3 MB 0.742 M allocs: 30.3 MB 1
Information/MIp/PF09645 20.3 k allocs: 0.819 MB 20.3 k allocs: 0.819 MB 1
Information/frequencies!/1 0 allocs: 0 B 0 allocs: 0 B
Information/frequencies!/2 0 allocs: 0 B 0 allocs: 0 B
Information/highlevel/BLMI 19.9 k allocs: 1.19 MB 19.9 k allocs: 1.19 MB 1
Information/highlevel/buslje09 0.0377 M allocs: 2.3 MB 0.0377 M allocs: 2.3 MB 1
Information/shannon_entropy/PF09645 0.047 k allocs: 12.2 kB 0.047 k allocs: 12.2 kB 1
MSA/Annotations/filtercolumns/boolean mask 18 allocs: 5.22 kB 18 allocs: 5.22 kB 1
MSA/Annotations/filtercolumns/index array 16 allocs: 1.62 kB 16 allocs: 1.62 kB 1
MSA/Base.vcat/annotated 0.143 k allocs: 6.58 kB 0.143 k allocs: 6.58 kB 1
MSA/Base.vcat/unannotated 0.064 k allocs: 2.7 kB 0.064 k allocs: 2.7 kB 1
MSA/Residue conversions/char2res 3 allocs: 4.1 MB 3 allocs: 4.1 MB 1
MSA/Residue conversions/int2res 3 allocs: 4.1 MB 3 allocs: 4.1 MB 1
MSA/Residue conversions/res2char 3 allocs: 2.05 MB 3 allocs: 2.05 MB 1
MSA/Residue conversions/res2int 3 allocs: 4.1 MB 3 allocs: 4.1 MB 1
MSA/hobohmI/pid62 31 allocs: 1.77 kB 31 allocs: 1.77 kB 1
MSA/identity/matrix_Float64 0.249 k allocs: 11.8 kB 0.249 k allocs: 11.8 kB 1
MSA/identity/mean 1.23 k allocs: 0.0517 MB 1.23 k allocs: 0.0517 MB 1
MSA/read/FASTA.gz 0.443 k allocs: 0.0752 MB 0.443 k allocs: 0.0752 MB 1
MSA/read/FASTA.gz_annotated 0.536 k allocs: 0.0794 MB 0.533 k allocs: 0.0793 MB 1
MSA/read/FASTA_deletefullgaps 13.6 k allocs: 17.4 MB 13.6 k allocs: 17.4 MB 1
MSA/read/FASTA_deletefullgaps_mapping 1.64 M allocs: 0.0795 GB 1.64 M allocs: 0.0795 GB 1
MSA/read/Stockholm 0.402 k allocs: 0.033 MB 0.402 k allocs: 0.033 MB 1
MSA/read/Stockholm_annotated 0.559 k allocs: 0.0413 MB 0.559 k allocs: 0.0413 MB 1
MSA/read/Stockholm_mapping 2.08 k allocs: 0.104 MB 2.08 k allocs: 0.104 MB 1
MSA/read/Stockholm_mapping_coords 1.64 k allocs: 0.0812 MB 1.64 k allocs: 0.0812 MB 1
MSA/write/FASTA 0.303 k allocs: 14.1 kB 0.303 k allocs: 14.1 kB 1
PDB/_generate_interaction_keys/defaults 0.497 k allocs: 0.0581 MB 0.497 k allocs: 0.0581 MB 1
PDB/_get_matched_Cαs/hemoglobin 0.584 k allocs: 0.0438 MB 0.584 k allocs: 0.0438 MB 1
PDB/_pdbresidues_to_mmcifdict/2vqc 8.56 k allocs: 1.12 MB 8.56 k allocs: 1.12 MB 1
PDB/contact/1CBN_20_30_CB 4 allocs: 0.281 kB 4 allocs: 0.281 kB 1
PDB/contact/1CBN_20_30_heavy 4 allocs: 0.281 kB 4 allocs: 0.281 kB 1
PDB/count_alanine/1CBN 0 allocs: 0 B 0 allocs: 0 B
PDB/distance/1CBN_20_30 0 allocs: 0 B 0 allocs: 0 B
PDB/read/MMCIFFile 0.039 M allocs: 2.9 MB 0.039 M allocs: 2.9 MB 1
PDB/squared_distance/1CBN_20_30_CB 4 allocs: 0.281 kB 4 allocs: 0.281 kB 1
PDB/squared_distance/1CBN_20_30_heavy 4 allocs: 0.281 kB 4 allocs: 0.281 kB 1
Pfam/accession mapping/acc2seqnames 4.32 k allocs: 0.319 MB 4.32 k allocs: 0.319 MB 1
SIFTS/ResidueDetails/_get_details 25 allocs: 1.45 kB 25 allocs: 1.45 kB 1
SIFTS/ResidueDetails/_is_missing 25 allocs: 1.45 kB 25 allocs: 1.45 kB 1
SIFTS/SIFTSResidue/18gs 4 allocs: 0.125 kB 4 allocs: 0.125 kB 1
SIFTS/siftsmapping/2vqc 5.94 k allocs: 0.88 MB 5.94 k allocs: 0.88 MB 1
Utils/get_n_words/ascii 5 allocs: 0.203 kB 5 allocs: 0.203 kB 1
Utils/get_n_words/utf8 5 allocs: 0.219 kB 5 allocs: 0.219 kB 1
Utils/hascoordinates/invalid 0 allocs: 0 B 0 allocs: 0 B
Utils/hascoordinates/valid 0 allocs: 0 B 0 allocs: 0 B
Utils/list2matrix/upper 3 allocs: 1.91 MB 3 allocs: 1.91 MB 1
Utils/list2matrix/upper_diagonal 6 allocs: 2.86 MB 6 allocs: 2.86 MB 1
Utils/matrix2list/upper 3 allocs: 0.952 MB 3 allocs: 0.952 MB 1
Utils/matrix2list/upper_diagonal 3 allocs: 0.956 MB 3 allocs: 0.956 MB 1
time_to_load 0.149 k allocs: 11.1 kB 0.149 k allocs: 11.1 kB 1

@coveralls
Copy link

coveralls commented Nov 24, 2025

Coverage Status

coverage: 97.156% (+0.02%) from 97.135%
when pulling a12a7e0 on get_n_words_perf
into 8548471 on master.

@diegozea
Copy link
Owner Author

Note: Benchmark CI shows only a modest micro-benchmark improvement for get_n_words and no visible speedup when reading large Stockholm files. The main benefit of this change is clearer behavior for empty/whitespace lines, enforcing the “n or fewer words” contract, and simplifying the implementation.

@diegozea diegozea merged commit 04de5ee into master Nov 24, 2025
17 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants