Skip to content

Conversation

@tgross
Copy link
Member

@tgross tgross commented Sep 5, 2025

During a large volume dispatch load test, I discovered that a lot of the total scheduling time is being spent calling structs.ParsePortRanges repeatedly, in order to parse the reserved ports configuration of the node (ex. converting "80,8000-8001" to []int{80, 8000, 8001}). A close examination of the profiles shows that the bulk of the time is being spent hashing the keys for the map of ports we use for de-duplication, and then sorting the resulting slice.

The (*NetworkIndex) SetNode method that calls the offending ParsePortRanges merges all the ports into the UsedPorts map of bitmaps at scheduling time. Which means the consumer of the slice is already de-duplicating and doesn't care about the order. The only other caller of ParsePortRanges is when we validate the configuration file, and that throws away the slice entirely.

By skipping de-duplication and not sorting, we can cut down the runtime of this function by 30x and memory usage by 4x.

See my comment here for why memoizing the result proved to be impractical. This changeset also deletes the unused ParseReservedHostPorts method and moves its tests into the tests for ParsePortRanges (which is really what it was testing anyways).

Ref: https://github.com/hashicorp/nomad/blob/v1.10.4/nomad/structs/network.go#L201
Fixes: #26654


I'm going to reproduce the test I ran that I've described in #26654 but in the meantime, here's a microbenchmark:

func BenchmarkParsePortRangesOld(b *testing.B) {
       spec := "22,8000-9000"
       for b.Loop() {
               ParsePortRanges(spec)
       }
}

func BenchmarkParsePortRangesNew(b *testing.B) {
       spec := "22,8000-9000"
       for b.Loop() {
               ParsePortRangesNew(spec)
       }
}
$ go test -v -count=1 ./nomad/structs -benchmem -bench BenchmarkParsePortRanges -run=^#
goos: linux
goarch: amd64
pkg: github.com/hashicorp/nomad/nomad/structs
cpu: Intel(R) Core(TM) Ultra 7 165H
BenchmarkParsePortRangesOld
BenchmarkParsePortRangesOld-22              7041            149205 ns/op           99616 B/op         37 allocs/op
BenchmarkParsePortRangesNew
BenchmarkParsePortRangesNew-22            233440              4926 ns/op           25288 B/op         15 allocs/op
PASS
ok      github.com/hashicorp/nomad/nomad/structs        2.214s

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

@tgross tgross force-pushed the NMD942-no-sorting-on-parseportranges branch from 9f41577 to cab99d3 Compare September 5, 2025 20:16
@tgross tgross changed the title networking: don't sort reserved port ranges before adding to bitmap scheduler: don't sort reserved port ranges before adding to bitmap Sep 5, 2025
@tgross tgross added theme/networking backport/1.10.x backport to 1.10.x release line labels Sep 5, 2025
@@ -0,0 +1,3 @@
```release-note:improvement
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: I'm open to calling this a bug so that we can backport it. Seems like a nice win for ENT customers on the LTS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to backporting

During a large volume dispatch load test, I discovered that a lot of the total
scheduling time is being spent calling `structs.ParsePortRanges` repeatedly, in
order to parse the reserved ports configuration of the node (ex. converting
`"80,8000-8001"` to `[]int{80, 8000, 8001}`). A close examination of the
profiles shows that the bulk of the time is being spent hashing the keys for the
map of ports we use for de-duplication, and then sorting the resulting slice.

The `(*NetworkIndex) SetNode` method that calls the offending `ParsePortRanges`
merges all the ports into the `UsedPorts` map of bitmaps at scheduling
time. Which means the consumer of the slice is already de-duplicating and
doesn't care about the order. The only other caller of `ParsePortRanges` is when
we validate the configuration file, and that throws away the slice entirely.

By skipping de-duplication and not sorting, we can cut down the runtime of this
function by 30x and memory usage by 3x.

Ref: https://github.com/hashicorp/nomad/blob/v1.10.4/nomad/structs/network.go#L201
Fixes: #26654
schmichael
schmichael previously approved these changes Sep 5, 2025
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

chrisroberts
chrisroberts previously approved these changes Sep 5, 2025
Copy link
Member

@chrisroberts chrisroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Might be nice to call out that the result may now include duplicates in the function doc.

@tgross
Copy link
Member Author

tgross commented Sep 8, 2025

Going to move this back into draft, as I realized there's a minor DoS opportunity here with this approach because a misconfigured node could have a range like "1-65536,1-65536,1-65536,1-65536,1-65536,1-65536,..." and blow up the memory usage. Protecting against this kind of thing isn't part of our current security model (we don't assume totally untrusted nodes), but there are plenty of folks who do things like use text templating to generate configuration and could mess that up. The approach I'm working on both limits the total number of ports and parses them lazily using an iterator.

@tgross tgross marked this pull request as draft September 8, 2025 13:39
@tgross
Copy link
Member Author

tgross commented Sep 8, 2025

Using an iterator greatly improved the performance of this function... but didn't actually help further reduce CPU/memory usage because the caller needs to apply the results of the iterator multiple times over different bitmaps (at least one for each interface). I also tried a reusable iterator but that ended up needing to memoize the []uint64 slice anyways, so it added a lot of complexity for zero improvement in performance over my original approach here.

So I've updated with the docstring comments requested and added a guard on the total number of ports, as well as found an opportunity to use slices.Grow to be more efficient with allocations. That feels about as good as we're going to get with this round of changes.

I'm also going to follow-up with a second PR that removes the COMPAT(0.11) sections from the SetNode call.

@tgross tgross added backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.9.x+ent Changes are backported to 1.9.x+ent labels Sep 8, 2025
Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@tgross
Copy link
Member Author

tgross commented Sep 8, 2025

I ran a similar high dispatch load test to the one I originally ran for #26654 (with a larger instance size because that's what I had handy).

Without this patch:

follower-cpu-profile-before

With this patch:

follower-cpu-profile-after

@tgross tgross merged commit f86a141 into main Sep 8, 2025
46 checks passed
@tgross tgross deleted the NMD942-no-sorting-on-parseportranges branch September 8, 2025 16:05
@tgross tgross added this to the 1.10.x milestone Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.9.x+ent Changes are backported to 1.9.x+ent backport/1.10.x backport to 1.10.x release line theme/networking theme/scheduling type/enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scheduler calls expensive structs.ParsePortRanges repeatedly

5 participants