-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed a transcoding bug that occurred when remote transcoder was removed #2747
Fixed a transcoding bug that occurred when remote transcoder was removed #2747
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2747 +/- ##
===================================================
- Coverage 56.34825% 56.32772% -0.02053%
===================================================
Files 88 88
Lines 19147 19138 -9
===================================================
- Hits 10789 10780 -9
Misses 7767 7767
Partials 591 591
Continue to review full report at Codecov.
|
Works great, did many tests with removing and adding multiple Ts without any dropped streams. |
This looks like a great simplification and as far as I can tell, the original code is just convoluted because of micro-optimisations that we don't need. Could you also please add a unit test so that we can avoid breaking this again in the future? |
Yep! I will write some tests, was thinking the same. This is a critical functionality. Thanks for looking it over 😎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, great simplification.
+1 for adding unit tests.
Also wondering, what's the functional difference between the old and the new code? Isn't it working exactly the same, how does it fix the issue you encountered?
Already has a unit test BTW: Lines 372 to 420 in ab553b4
|
Good point @stronk-dev - I'd still want to update that test such that it would've caught the case that was breaking though |
It appears some funky logic or race condition led to it. I couldn't follow the logic there. I discovered this after monitoring my own node. We then tested with real test streams until it was a confirmed bug, and then worked through the solution. The issue occurred only when there was 2 or more streams on the transcoder that was disconnected. I wrote a new test which reproduced the issue on the old version of Lines 423 to 477 in 707115e
In the test which uses 2 remote transcoders and 8 streams: |
Added new test |
@leszko @thomshutt Any chance this can make it into 0.5.38? #2753 |
core/orch_test.go
Outdated
|
||
// register 2 transcoders, 8 stream capacity each | ||
go func() { rtm.Manage(transcoder_1, 8, capabilities.ToNetCapabilities()) }() | ||
time.Sleep(1 * time.Millisecond) // allow time for first stream to register |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this test be flaky? I think depending on timing (especially in ms) can make this test sometime pass, sometimes not. I'd avoid having time-dependent tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the wait commands from another test. I think they are not needed. I didn't see anything in my development that indicates they are required, so I can try removing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, then I suggest removing them if not needed. Then, Do you need to run these functions above in separate goroutines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, apparently the separate go routine is required in this case, if not done with the go routine, the test will timeout. I tried removing the time.sleep between them, but it caused test to move on without transcoder connected at all.
I pushed a commit to cleanup a few related lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, THB, in my opinion, it's better to not have tested with time.Sleep()
, because they will sometimes fail sometimes pass in our CI. So, if it's not possible (or easy) to write a test without sleep, then I suggest removing this test and just merging without it. @thomshutt wdyt?
The improvement to the code is already good, so should be safe to merge.
@leszko @eliteprox I don't understand - if the problem is this method being called by multiple goroutines in an unsafe way then shouldn't the problem be fixed further up the stack? If the problem is inside |
From my current understanding, there are 2 issues:
Then, I suggest doing the following:
|
Can we simply continue using the existing I've removed |
I'm ok with this. I think this change is safe as it is and it simplifies the code a lot. @thomshutt wdyt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Let's wait for @thomshutt to comment/approve and then we can merge.
Been running this upgrade on prod for 4 days with no errors to report |
LGTM, thanks for the fix and for responding to all our questions @eliteprox! |
What does this pull request do? Explain your changes. (required)
This change fixes a bug that caused an orchestrator to lose all transcoders when one of it's transcoders disconnects unexpectedly.
Specific updates (required)
removeFromRemoteTranscoders
so that correct transcoder is removed from RemoteTranscoderManagerHow did you test each of these updates (required)
We tested up to 4 streams going to an orchestrator with up to 4 remote transcoders. We stopped one of the transcoders in the following scenarios and observed streams move to another transcoder without disconnecting any other transcoders.
In every case we observed the streams safely move to another available transcoder without causing other transcoders to break.
Does this pull request close any open issues?
#2605
#2706
Checklist:
make
runs successfully./test.sh
pass