Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T does not automatically connect to O after restart or crash .35 and .36 #2704

Closed
AuthorityNull opened this issue Dec 29, 2022 · 1 comment · Fixed by #2705
Closed

T does not automatically connect to O after restart or crash .35 and .36 #2704

AuthorityNull opened this issue Dec 29, 2022 · 1 comment · Fixed by #2705
Labels
status: triage this issue has not been evaluated yet

Comments

@AuthorityNull
Copy link

AuthorityNull commented Dec 29, 2022

Describe the bug
This is potentially related to issue #2702.

T's are getting stuck trying to reconnect to an O after a restart or crash. This does not happen 100% of the time, but most commonly occurs when an O restarts incorrectly/crashes.

The T must be manually restarted before it can connect to the O again, or else it just hangs here:

Dec 29 23:30:18 fortuna livepeer[299586]: I1229 23:30:18.319760  299586 ot_rpc.go:141] End of stream receive cycle because of err="rpc error: code = Unavailable desc = error reading from server: EOF", waiting for running transcode jobs to complete

Thanks to Marco, we may have narrowed down the section of code where the issue is occurring:

var wg sync.WaitGroup
for {
notify, err := r.Recv()
if err := checkTranscoderError(err); err != nil {
glog.Infof(`End of stream receive cycle because of err=%q, waiting for running transcode jobs to complete`, err)
wg.Wait()
return err
}
wg.Add(1)
if notify.SegData != nil && notify.SegData.AuthToken != nil && len(notify.SegData.AuthToken.SessionId) > 0 && len(notify.Url) == 0 {
// session teardown signal
n.Transcoder.EndTranscodingSession(notify.SegData.AuthToken.SessionId)
} else {
go func() {
runTranscode(n, orchAddr, httpc, notify)
wg.Done()
}()
}
}
}

To Reproduce
Steps to reproduce the behavior:

  1. Launch an O and T
  2. Restart the O, or simulate a crash.
  3. Observe T's logs.
  4. See error

Expected behavior
T should automatically connect back to an O regardless of how an O restarts.

Desktop (please complete the following information):

  • OS: Linux/Ubuntu
  • Version .35 and .36
@github-actions github-actions bot added the status: triage this issue has not been evaluated yet label Dec 29, 2022
@stronk-dev
Copy link
Contributor

Yea, my thinking is the bug got introduced in this commit: a1fb761#diff-0db7e4513a2e3eb16dedb22ed6a0920e6648e62733004c11dc0ed8a19f314464

Seems to me like the wait group can have more wg.Add(1) than wg.Done() calls when it enters the if statement where it calls EndTranscodingSession

Currently testing if moving the wg.Add(1) into the else statement fixes the issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: triage this issue has not been evaluated yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants