T does not automatically connect to O after restart or crash .35 and .36 #2704

AuthorityNull · 2022-12-29T22:41:31Z

Describe the bug
This is potentially related to issue #2702.

T's are getting stuck trying to reconnect to an O after a restart or crash. This does not happen 100% of the time, but most commonly occurs when an O restarts incorrectly/crashes.

The T must be manually restarted before it can connect to the O again, or else it just hangs here:

Dec 29 23:30:18 fortuna livepeer[299586]: I1229 23:30:18.319760  299586 ot_rpc.go:141] End of stream receive cycle because of err="rpc error: code = Unavailable desc = error reading from server: EOF", waiting for running transcode jobs to complete

Thanks to Marco, we may have narrowed down the section of code where the issue is occurring:

go-livepeer/server/ot_rpc.go

Lines 124 to 143 in e2c46a1

    
           	var wg sync.WaitGroup 
        
           	for { 
        
           		notify, err := r.Recv() 
        
           		if err := checkTranscoderError(err); err != nil { 
        
           			glog.Infof(`End of stream receive cycle because of err=%q, waiting for running transcode jobs to complete`, err) 
        
           			wg.Wait() 
        
           			return err 
        
           		} 
        
           		wg.Add(1) 
        
           		if notify.SegData != nil && notify.SegData.AuthToken != nil && len(notify.SegData.AuthToken.SessionId) > 0 && len(notify.Url) == 0 { 
        
           			// session teardown signal 
        
           			n.Transcoder.EndTranscodingSession(notify.SegData.AuthToken.SessionId) 
        
           		} else { 
        
           			go func() { 
        
           				runTranscode(n, orchAddr, httpc, notify) 
        
           				wg.Done() 
        
           			}() 
        
           		} 
        
           	} 
        
           }

To Reproduce
Steps to reproduce the behavior:

Launch an O and T
Restart the O, or simulate a crash.
Observe T's logs.
See error

Expected behavior
T should automatically connect back to an O regardless of how an O restarts.

Desktop (please complete the following information):

OS: Linux/Ubuntu
Version .35 and .36

The text was updated successfully, but these errors were encountered:

stronk-dev · 2022-12-30T09:45:10Z

Yea, my thinking is the bug got introduced in this commit: a1fb761#diff-0db7e4513a2e3eb16dedb22ed6a0920e6648e62733004c11dc0ed8a19f314464

Seems to me like the wait group can have more wg.Add(1) than wg.Done() calls when it enters the if statement where it calls EndTranscodingSession

Currently testing if moving the wg.Add(1) into the else statement fixes the issue...

github-actions bot added the status: triage this issue has not been evaluated yet label Dec 29, 2022

stronk-dev mentioned this issue Dec 30, 2022

Fix: transcoders wait forever on orchestrator restart #2705

Merged

cyberj0g closed this as completed in #2705 Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T does not automatically connect to O after restart or crash .35 and .36 #2704

T does not automatically connect to O after restart or crash .35 and .36 #2704

AuthorityNull commented Dec 29, 2022 •

edited

Loading

stronk-dev commented Dec 30, 2022

T does not automatically connect to O after restart or crash .35 and .36 #2704

T does not automatically connect to O after restart or crash .35 and .36 #2704

Comments

AuthorityNull commented Dec 29, 2022 • edited Loading

stronk-dev commented Dec 30, 2022

AuthorityNull commented Dec 29, 2022 •

edited

Loading