-
Notifications
You must be signed in to change notification settings - Fork 17
RHELMISC-7214: Try to rerun CI when queue_test failed with crash with [RFC] #612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
RHELMISC-7214: Try to rerun CI when queue_test failed with crash with [RFC] #612
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title and the content of the change does not match.
Why does RToolsHCK#list_test_results
return nil
instead of raising an error? It sounds like a bug.
This bug on Microsoft side, and i cant reproduce it locally |
I think it is a bug of rToolsHCK. |
When we look into logs with this error, we can see Just a small example in ruby: 3.3.7 :001 > x = [1,2,3]
=> [1, 2, 3]
3.3.7 :002 > x.select { |t| t > 1 }
=> [2, 3]
3.3.7 :003 > x.select { |t| t > 5 }
=> [] What do you think? @Jedoku I don't agree with your solution. Tools#list_test_results is just a wrapper and it should not decide what to do if data is provided without any error. Currently, The fix for this issue should be in the Tests class. The Tests class knows that the test was queued and then results MUST exist. If the results are missing in this case it means You fix do nothing with this problem. |
Solition looks like this: def wait_queued_test(id)
# Add timeout to avoid infinite loop ? (5 minutes)
# Need to check existing logs to see timeout cases
loop do
sleep 5
results = @tools.list_test_results(id, @target['key'], @client.name, @tag)
return false if results == nil
last_result = results.max_by { |k| k['instanceid'].to_i }
check_test_queued_time
return true if last_result['status'] == 'InQueue'
return true if last_result['status'] == 'Running'
return true if test_finished?(last_result)
end
end
def queue_test(test, wait: false)
for i in 1..5 do
@tools.queue_test(test_id: test['id'],
target_key: @target['key'],
machine: @client.name,
tag: @tag,
support: test_support(test),
parameters: test_parameters(test['name']))
@tests_extra[test['id']] ||= {}
@tests_extra[test['id']]['queued_at'] = DateTime.now
@last_queued_id = test['id']
return unless wait
return if wait_queued_test(test['id'])
end
raise "Failed to queue test #{test['name']} after 5 attempts"
end |
@kostyanf14
In either case, a correct fix would be to raise an error and trigger a retry. |
NO. In logs, we see 5 times (tools already implement retry mechanism)
this is from https://github.com/HCK-CI/rtoolsHCK/blob/master/tools/toolsHCK.ps1#L1820 |
Oh, I see. Raising an exception in such a case is a questionable behavior, but at least the exception should be delivered to the user instead of returning |
Are you about the
This is done I am not sure that raising exceptions will help us to fix the real issue. |
I referred to the
It's better to change it to raise an exception. Returning |
Ok. |
Can we split this fix into 2 PRs:
What do you think? |
I have no idea. Usually this kind of bugs should be investigated by adding more logs; it will be nice if HLK has some kind of operation logs but I'm not sure if HLK has such a feature.
I think we can skip the quick fix and just do 2. This crash is something hard to reproduce (i.e., not frequent) so I guess we don't need to hurry to make a quick fix. |
By the way, I think this pull request should be closed. Apparently |
Unfortunately no, but we can just try to dump all the HLK data that we got and then try to analyze it. |
No description provided.