Skip to content

Conversation

@FGasper
Copy link
Collaborator

@FGasper FGasper commented Dec 9, 2025

Verifier historically considered all mismatches coequally significant. For logging this is unhelpful: ideally users should see those mismatches that are recurring in successive generations. This changeset takes a step toward that by tracking how long a given mismatch has appeared in Verifier checks and surfacing the “longest-lived” mismatches in the log.

To do this, two pieces of information are added when tracking a mismatch:

  • when it was first seen
  • how long the most recent occurrence happened after that first sight (i.e., the mismatch’s “duration”)

These get saved into both the mismatch (for logging purposes) and the recheck entry.

When Verifier creates a new task from recheck entries, that task also contains the relevant document’s mismatch times (first-seen time & duration). If any non-mismatch rechecks exist for the document, the mismatch times are omitted from the task. This is because those non-mismatch rechecks will have come from change events, which means the document has changed, so any further mismatch will be a “new” one. (In this case, of course, hopefully the change event means the mismatch is now fixed!)

If, on recheck, the document is still mismatched, the mismatch’s duration is recomputed (i.e., the time since the mismatch’s first-seen time), and that new duration is saved with the new recheck entry. (If there’s no mismatch, then there’s no recheck.)

Finally: when logging mismatches, Verifier now also shows the mismatch’s duration and sorts the displayed mismatches longest-duration-first. To offset this additional “noise” in the logs, Verifier no longer shows destination namespaces in document mismatch logs.

This changeset also includes some tangential-but-opportune logging improvements:

  • BUG FIX: Verifier no longer tells people to query the tasks collection. Instead, the status line refers to the documentation (which contains an aggregation command for this).
  • BUG FIX: Verifier no longer logs context cancellation errors when its logging is interrupted.
  • Logs now distinguish missing-on-dst documents from extra-on-dst. Previously the logs labeled both scenarios as “missing”, which was always confusing.
  • Log entries’ final status lines are now more nuanced. If all shown logs are 0s in duration, for example, the status line no longer says simply, “Mismatches found”. Instead Verifier now says that all mismatches are “new” and will (hopefully) be resolved.
  • Some headers are now abbreviated & made more consistent.

Ancillary, implementation-level changes:

  • getDocIdFromComparison() now asserts rather than ignoring errors. (It should never fail.)
  • This includes a new library to make writing MongoDB aggregations far more ergonomic in Go.

@FGasper FGasper changed the title Rep 6882 track mismatch length REP-6882 Track mismatch duration Dec 9, 2025
@FGasper FGasper requested a review from autarch December 9, 2025 16:04
@FGasper FGasper marked this pull request as ready for review December 9, 2025 16:04
Copy link
Collaborator

@autarch autarch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very big PR, and I'm really struggling to follow it.

I think this would be a lot easier to review if split this PR up a bit. I think there are a few things it could be broken into:

  1. Add the aggregation stuff and use it.
  2. Rename "mismatch" -> "problem".
  3. Add the mismatch time stuff.
  4. Maybe separate out the changes in output too, but if 2 and 3 were split up, this may not be a big deal.

I'm not sure if it makes sense to split out 1, since I don't think you want to write new aggregations by hand and then use the new aggregation code.

In particular, combining 2 & 3 makes this much harder to review, since there's a lot of places where the name of a thing changes and/or there's a logic change. And it's hard to see whether there is a logic change.

This would be a good use of Graphite. I'd note that you could probably make full use of Graphite and merge stacks all together, since this repo doesn't use Evergreen.

Anyway, overall, this looks reasonable, inasmuch as I can follow it. Is there anyone else on the team who knows this code better than me? Maybe they can take a look? If not, I really think this needs to be broken up.

return problems, docCount, byteCount, err
}

func (verifier *Verifier) compareDocsFromChannels(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is very hard to follow. It's a massive method that mostly consists of closures. It would be great to break this up. I know you didn't write this originally, but you are adding yet another closure here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t write the tool itself, but the code in question probably is, in fact, my work. (In my defense, this tool always seems to be “maintenance-only … until uh-oh it now needs to be better.”)

I split the new stuff out into a new struct, in a separate file. I think a broad-scale refactor would help, but it’d be more ideal as a follow-up.

Comment on lines 128 to 140
var idToMismatch map[string]recheck.MismatchTimes
getPriorMismatchTimes := func(doc bson.Raw) option.Option[recheck.MismatchTimes] {
if idToMismatch == nil {
idToMismatch = createIdToMismatchTimes(task)
}

mapKeyBytes := rvToMapKey(
nil,
getDocIdFromComparison(verifier.docCompareMethod, doc),
)

return option.IfNotZero(idToMismatch[string(mapKeyBytes)])
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a closure? It seems like you could make this a separate func without too much trouble.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s a separate struct now.

pool.Put(srcDoc.doc)
pool.Put(dstDoc.doc)
var mismatchTimes option.Option[recheck.MismatchTimes]
if len(curProblems) > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like if len(curProblems) == 0 you could just short-circuit and return nil instead of checking it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 56 to 57
// TODO

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forget to add or delete something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to delete the comment. :-P fixed

@FGasper FGasper force-pushed the REP-6882-track-mismatch-length branch 2 times, most recently from 2ba4f93 to 0f5079a Compare December 11, 2025 19:14
@FGasper FGasper force-pushed the REP-6882-track-mismatch-length branch from 2cb8576 to 4ac6c1e Compare December 11, 2025 19:59
@FGasper
Copy link
Collaborator Author

FGasper commented Dec 11, 2025

@autarch Thank you for your notes.

I’ve done what I think I reasonably can here to derisk the review:

  • removed the rename
  • split out agg & bson changes to separate commits
  • reduced the callback-hell in compare.go

… but it’s still a large changeset, regrettably. (I hoped also to separate the logging changes, but there didn’t seem a viable “middle ground”.)

I did make other changes since you looked:

  • Added an arch doc on mismatch handling. (Hopefully this helps!)
  • Added more tests for the new logic.
  • The recheck & task docs now store only the first-mismatch time. (They don’t need the mismatch duration.)

@FGasper FGasper requested review from autarch and tdq45gj December 11, 2025 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants