Skip to content

Conversation

@michaelsembwever
Copy link
Member

https://github.com/riptano/cndb/issues/16028

Port into main-5.0 commit f5af71f

CNDB-16028: CNDB-15967: Add a reachability fence to Ref#release to prevent races between the state release and the reference reaper (#2125)

When releasing a `Ref`, we occasionally see this release fail as a "bad release", indicating that the ref has already been released. At apparently the exact same time (to the resolution of our logs), we see the reference-reaper report a leak of the same reference and release the reference. This happens most frequently in GlobalTidyConcurrencyTest, which intentionally stresses these subsystems. In particular, for this issue to occur, there must be no strong references to the ref outside the reference on the stack frame of the method calling `ref.release` _and_ no further usages of `ref` within that method.

The Ref `ref` has a `state` field, which refers to a `Ref$State` which is also a `PhantomReference` to `ref`. When `ref.release()` is called, it calls `state.release()`. During the execution of `state.release()` but prior to the usage of `releasedUpdater` to update the `released` field, the JIT may determine that `ref` will not be used and clear its reference on the stack. Then, the GC may determine that `ref` is only phantom reachable, clearing `state` as a `PhantomReference` and enqueuing it into `referenceQueue`. The `Reference-Reaper` may then take `state` from the `referenceQueue` and release it as a leak. This leak release can update the `released` field, causing it to report a leak. The thread executing `state.release()` may then resume, and when it attempts to update the `released` field it will fail and report a bad release. Ultimately, since it's a race between a correct release and a Reference-Reaper release, these are both spurious errors and the system continues to operate correctly.

To prevent this race, we must ensure that `ref` remains strongly reachable throughout the execution of `state.release()`. A reachability fence to itself will prevent the JIT from clearing the reference on the stack, keeping it strongly reachable.

This PR adds a reachability fence to `ref.release()` after the state release call. It also adds a byteman rule to occasionally induce GCs that can determine `ref` to be phantom reachable in `GlobalTidyConcurrencyTest`. This byteman rule causes the test to reliably fail without the reachability fence.

…event races between the state release and the reference reaper (#2125)

When releasing a `Ref`, we occasionally see this release fail as a "bad
release", indicating that the ref has already been released. At
apparently the exact same time (to the resolution of our logs), we see
the reference-reaper report a leak of the same reference and release the
reference. This happens most frequently in GlobalTidyConcurrencyTest,
which intentionally stresses these subsystems. In particular, for this
issue to occur, there must be no strong references to the ref outside
the reference on the stack frame of the method calling `ref.release`
_and_ no further usages of `ref` within that method.

The Ref `ref` has a `state` field, which refers to a `Ref$State` which
is also a `PhantomReference` to `ref`. When `ref.release()` is called,
it calls `state.release()`. During the execution of `state.release()`
but prior to the usage of `releasedUpdater` to update the `released`
field, the JIT may determine that `ref` will not be used and clear its
reference on the stack. Then, the GC may determine that `ref` is only
phantom reachable, clearing `state` as a `PhantomReference` and
enqueuing it into `referenceQueue`. The `Reference-Reaper` may then take
`state` from the `referenceQueue` and release it as a leak. This leak
release can update the `released` field, causing it to report a leak.
The thread executing `state.release()` may then resume, and when it
attempts to update the `released` field it will fail and report a bad
release. Ultimately, since it's a race between a correct release and a
Reference-Reaper release, these are both spurious errors and the system
continues to operate correctly.

To prevent this race, we must ensure that `ref` remains strongly
reachable throughout the execution of `state.release()`. A reachability
fence to itself will prevent the JIT from clearing the reference on the
stack, keeping it strongly reachable.

This PR adds a reachability fence to `ref.release()` after the state
release call. It also adds a byteman rule to occasionally induce GCs
that can determine `ref` to be phantom reachable in
`GlobalTidyConcurrencyTest`. This byteman rule causes the test to
reliably fail without the reachability fence.
@github-actions
Copy link

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants