-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Classic mod option to use stable's hit windows #26452
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basic first impressions pass before i even attempt to verify this against ground truth (which will likely take days so i don't want to spend too much time on that before getting the shape of this correct)
|
||
/// <summary> | ||
/// Retrieve a valid list of <see cref="DifficultyRange"/>s representing hit windows. | ||
/// Defaults are provided but can be overridden to customise for a ruleset. | ||
/// </summary> | ||
protected virtual DifficultyRange[] GetRanges() => base_ranges; | ||
|
||
public class LegacyHitWindows : HitWindows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no sense calling things "legacy" if they're just going to exist forevermore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really agree with this argument. I used this word because standard, taiko, catch and mania are called legacy rulesets in the codebase:
/// Check whether this <see cref="IRulesetInfo"/>'s online ID is within the range that defines it as a legacy ruleset (ie. either osu!, osu!taiko, osu!catch or osu!mania). |
And some kind of word is necessary (at least with the new class approach), because these hit windows would not be recommended to be used for new custom rulesets, and we're only keeping them to not diverge from stable. If you can think of a better name, feel free to recommend it (i also used
HistoricalHitWindows
before commiting).
/// <param name="difficulty">The difficulty parameter.</param> | ||
/// <param name="range">The range of difficulty values.</param> | ||
/// <returns>Value to which the difficulty parameter maps in the specified range.</returns> | ||
public virtual double ValueFor(double difficulty, DifficultyRange range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- needs better name (
CalculateWindowFor()
?) - needs to not be public
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the name to a more descriptive one (HitWindowValueFor
). I suppose the reason why I left it public is that another class might want to know the value of the "true" hit window (for example, for display when hovering over OD in song select). I'll just make it protected for now though.
/// <param name="timeOffset">The time offset.</param> | ||
/// <param name="result">The <see cref="HitResult"/>.</param> | ||
/// <returns>Whether the time offset is contained within the hit window of the hit result.</returns> | ||
public virtual bool Contains(double timeOffset, HitResult result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is gonna be a separate method, then I'd argue there's a high chance that every single caller of WindowFor()
is incorrect at that point, because that method existing may cause someone to wrongly attempt to use WindowFor()
to determine if an offset is contained hit result window by locally (and wrongly) reimplementing Contains()
.
What I'm getting at is that either this should not exist if possible, or that other method should not be public anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored Contains
to the separate method to make it a single source of truth (a judgement inequality was used hardcoded in two places, and I almost forgot to change it in the other place, which would've resulted in a bug). Again, possibly in the future another class might want to know the answer to the question Contains
answers, but the visibility remark is valid, so I'll just make the method protected for now.
I generally do not enjoy the complexity incurred by this solution. If possible, I would like to just adjust the edges of each hitwindow for each ruleset (by rounding, subtracting 0.5ms, adding 0.5ms) to match stable and remove the weird quirks of the subclassing and the overrides and the rounding. Have you investigated if that is feasible? I know you posted this wall of text but I got lost halfway through reading that, somewhere around
I'm not sure what the issue is there and how it would be "resolved". The explanation is rather dense and appears to have a lot of mental shortcuts so it's difficult for me to follow it properly. I would appreciate an extended explanation of what that has to do with this PR. |
Thank you for the review. I've answered all the comments. The reason why I am not changing the default
It is not possible to match hit windows by just changing the ranges's numbers while keeping the default
TL;DR of that part:
Edit:
I see now that by this, you might have been referring to the sentence "After this issue is resolved (...)" in my issue comment. "This issue" refers to the OP there, not the legacy score encoding being incorrect. |
Realistically does this matter in any way? If the
Are you referring to the flooring of the hitwindow ranges in I'm really looking to simplify here if possible and so far I'd say that I see no clear and convincing argument for |
You're right that the equality only makes a difference at the very edge, but when the ranges numbers are the round ones from the stable OD formula, then that edge is an integer, just like legacy replays, which is an important case of the judgement differing. If you change the range values to remedy that, then you're not flooring properly. I don't see an alternative solution you have in mind that doesn't have either of these issues, and I also don't see what's wrong with subtracting in the place where we also floor, which we have to. I explained why I separated |
And I explained what my problem is with I dunno, seems like this is an impasse as the conversation is going in circles and I might have to approach this with a fresh mind from scratch myself. |
Would making |
That's still not solving the issue. The issue is that the numerical hitwindow ranges are publicly exposed - despite the fact that they are used by rulesets differently - and are free to be compared wrong by anyone. If implementations are going to differ in how comparisons against the hitwindow ranges are done, the raw values should not be exposed except if explicitly signposted that they are display values only and should not be used for... well anything really at this point probably? In other words: |
I see now. There's definitely a problem to resolve here. There's already code that hard-assumes that containment in a hit window is always judged using
I'd argue that the hard-coding of this assumption is the real problem, I'm thinking back to the idea of making Let's just try keeping the current Annoyingly, while trying to verify by watching replays from the OP (particularly, the standard DT one), I hit the issue of judgement counts not matching between rewatches. I presume it might be due to the fact that I seeked to the end (if I remember correctly, the result counts from the OP were noted after watching replays without seeking), and there seem to be open issues noting that already. I'll note that:
|
I've performed fairly extensive testing aiming to verify the matching of the proposed hit windows against the ground-truth. With this PR, all legacy "true" hit window values are half-integers, so matching integer hit errors in legacy replays is sufficient to prove fractional hit errors also match. A utility was created that automatically sends inputs no earlier than a given hit error in stable, and used to observe the behavior around the hit window edges on stable and then lazer. For example, I've verified that on a OD 10 osu!mania map, the behavior of a 121 ms (rounded) early hit (Meh) and a 122 ms early hit (Miss) matches on both stable and this PR: The matching of other hit windows across the rulesets is verified similarly: One disparity was found: in osu!mania on stable, late 50s are impossible, resulting in a miss; there is no such special case on lazer. This isn't mentioned in the lazer gameplay differences article nor Walavouchey's table in the post on the issue, however it is documented in this article on the wiki since this commit, which is said to be sourced from stable: If a replay of such a play from stable is played on lazer, the late hit would result in a Meh, not a Miss: This is likely the cause of the nomod mania replay in the OP having one extra Meh instead of a Miss in lazer, as there is one late Meh hit in the replay (it's not around the biggest combo of the play though, so it doesn't explain the max combo disparity). Unsure whether reimplementing this behavior is desirable or not, it seems unintuitive to me and I wasn't aware of it before. If not, then mania will be slightly easier on lazer than stable, and thus stable's scores's accuracy might be slightly worse unless replayed on lazer. Unsure whether this is acceptable.
Edit: Actually, it seems the DT play's hit result counts discrepancy might also be caused by the late Meh hit window. Edit 2: Interestingly, the mania late 100 hit window on stable is stricter than the other late hit windows, I presume the late 100 special case comparison for some reason uses an exclusive comparison instead of inclusive like all the other hit windows. |
The remaining mania replay hit result count mismatches are due to two disparities not related to legacy hit window values: Disparity 1Consider the third column from the right. One frame after the screenshot, a tap is made in it. On stable, the tap is registered on the lower note as a 200. On lazer, the tap is registered on the higher note as a Perfect, and the lower note is left behind to become a Miss. This gameplay change between stable and lazer was made intentionally as part of this discussion, added with this PR, and has been called 'mania note lock'. Disparity 2Consider the second column from the right. Before the screenshot was taken, the gray pink note was hit 220 ms early (1.5x less than that in real-time), and resulted in a Miss in both lazer and stable, as this value is well within the miss hit window. A tap has been made in the column around the time of taking the screenshot, 218 ms earlier than the start time of the pink note above the gray pink note. On stable, the early hit is forgiven, and a future third hit is allowed to successfully hit the note. On lazer, the pink note is considered a Miss, and the future third hit is redundant. Not 100% sure what the cause of this disparity is. I believe the 218 ms early hit is still before the gray pink note's time, so possibly the gray pink note ate the input for this reason, even though the note was already a Miss? The hit times likely mean that the cause of this disparity is not related to hit window values. Thus, they're outside the scope of this PR. After the above testing, I'm fairly confident legacy hit windows match correctly. There are two ways in which the legacy hit window formula might be used: ruleset-wide, or only as an option in Classic mod. The Classic-only approach seems preferable (more intuitive hit window values and formula, increased precision of the OD scale, preserves result counts of scores set on lazer). However, it would require fixing legacy score encoding. Currently, lazer is encoding hit times very wrongly around the hit window edges, resulting in a mismatch in hit result counts between lazer gameplay and replays. Regardless, I've changed the implementation of legacy hit windows in |
8702f58
to
af98eb7
Compare
@@ -301,7 +301,7 @@ public bool OnPressed(KeyBindingPressEvent<ManiaAction> e) | |||
// The tail has a lenience applied to it which is factored into the miss window (i.e. the miss judgement will be delayed). | |||
// But the hold cannot ever be started within the late-lenience window, so we should skip trying to begin the hold during that time. | |||
// Note: Unlike below, we use the tail's start time to determine the time offset. | |||
if (Time.Current > Tail.HitObject.StartTime && !Tail.HitObject.HitWindows.CanBeHit(Time.Current - Tail.HitObject.StartTime)) | |||
if (Time.Current > Tail.HitObject.StartTime && !Tail.HitObject.HitWindows.CanEverBeHit(Time.Current - Tail.HitObject.StartTime)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is happening here? why "can ever be hit"? as opposed to what, can "sometimes" be hit? what is the meaning of this?
you talk about opening "follow up PRs", yet i am very uncomfortable with the complexity already incurred by this change as is, and think there's something going deeply wrong if things are headed this way. i do not believe it is necessary complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a rename, the behavior of this method does not change with this PR. I found the name CanBeHit
confusing, as when I first saw it, I thought it would be
Math.Abs(timeOffset) <= WindowFor(LowestSuccessfulHitResult())
(timeOffset
contained within the hit window range),
but it's supposed to be
timeOffset <= WindowFor(LowestSuccessfulHitResult())
(always true for negative timeOffset
).
The phrase "can ever be hit" is taken from pre-PR XML docs.
Supersedes #26419. Resolves #11311.
Creates new classes,LegacyHitWindows
andInclusiveLegacyHitWindows
, which override methods that perform judgements, to be used by standard and taiko, and mania, respectively.The theoretical basis for the judgement formulas is explained here.The taiko miss hit window case is solved by adjusting the miss hit window value to one equivalent to stable's miss hit window. I deemed this much less complex than implementing a mechanism to perform separate checks for different hit results just to handle this case.Adds an option to Classic mod for osu!, taiko and mania to use legacy (floored and half-integer) hit window values.
TestSceneSliderLateHitJudgement.cs
had to be corrected - the test was failing because the hardcoded hit errors assumed too lenient hit windows.I've also corrected two tests which used generic
HitWindows
while testing a particular legacy ruleset. Both tests worked one way or the other, so this is a minor change.While testing mania, I've also noticed that there is another factor contributing to hit windows not matching stable. Mania uses
base_ranges
inHitWindows
for its hit windows. These ranges match stable, except for Perfect:osu/osu.Game/Rulesets/Scoring/HitWindows.cs
Line 20 in 70ba5dd
On stable, it would be (16, 16, 16). Apparently this was done intentionally,
however I corrected this, for the same reasons that this PR exists in the first place(it is desirable to not devalue or overvalue historical stable scores, and this would be a significant deviation). I think it would also be a good idea to move these ranges intoManiaHitWindows
, and set something more intuitive asbase_ranges
as the default for new rulesets, instead of imposing mania's legacy hit windows onto them. I don't see any reasons not to do that, sincebase_ranges
only seem to be used for mania in this codebase, and custom ruleset developers can always provide their own ranges if they dislike this change, so I've gone ahead and also done that. This brokeTestSceneGameplaySampleTriggerSource
which used the wrong hit object class and hardcoded a value used in the test. This was also corrected.Testing has been performed by watching a replay on stable, live lazer and lazer with the change applied. It's best if the replay is of a play from stable, not lazer, because current lazer's hit result counts aren't the desired ones due to hit windows not matching, so it's difficult to get a ground-value judgement counts list to compare with.
Comparison of stable replays result counts
standard and taiko
Two scores are used for standard, one with and without DT, to test speed rate
Conclusion: standard and taiko hit windows seem to work as expected in this test.
mania
Mania is trickier to test using stable replays for several reasons:
Fortunately there exists a ranked and not converted mania map with only short notes and fractional OD: link.
Note that live lazer's Perfect hit windows are different from stable's.
Conclusion: even after the changes, mania doesn't quite match stable.
On 100% speed rate, the hit windows seem to work well in this test, but not quite ideally. The last score in the table differs slightly from stable, and I'm not quite sure why. One note is judged as a Meh, instead of a Miss. The hit results match when the Meh hit window is made 5 ms stricter (but not 4.5 ms stricter). But I don't see why that would be the right thing to do. The max combo should match on short notes but it does not. This might be a hint. More research is required, possibly by analyzing the replay data.
Non-100% speed rate hit results don't match particularly well. The Perfect counts on the convert don't match. On the third score, new lazer matches stable much more closely than live lazer, but it's still quite far from perfectly matching. Something is not quite right with speed rates, and more research is required. This doesn't appear to be the fault of the change either, as live lazer with just the Perfect hit window change also doesn't result in the same Perfect result count as stable on the third score.
Closing
In general, stable seems more closely matched after this change, especially on standard and taiko.
The tests in the scene from the superseded PR now pass. All other automatic tests also pass. Feedback and more testing welcome. Since the testing is not completely exhaustive, it is not impossible that there exist corner cases that still are not matching. Possible future directions: